Aspen Mesh 1.3

Announcing Aspen Mesh 1.3

We’re excited to announce the release of Aspen Mesh 1.3 which is based on Istio’s latest LTS release 1.3 (specific tag version 1.3.3). This release builds on our self-managed release (1.2 series), includes all the new capabilities added by the Istio community in release 1.3 plus a host of new Aspen Mesh features, all fully tested with production grade support ready for enterprise adoption.

The theme for Aspen Mesh and Istio 1.3 release was enhanced User Experience. The release includes an enhanced user dashboard that has been redesigned for easier navigation of service graph and cluster resources. The Aspen Mesh service graph view has been augmented to include ingress and egress services as well as easier access to health and policy details for nodes on the graph. While a service graph is a great tool for visualizing service communication as a team, we realized that in order  to quickly identify services that are experiencing problems, individual platform engineers need a view that allows them to dig deeper and gain additional insight into their services. To address this, we are releasing a new table view which provides access to additional information about clusters, namespaces and workloads including the ingress and egress services they are communicating with and any warnings or errors for those objects as detected by our open source configuration analyzer Istio Vet.

Aspen Mesh 1.3

The Istio community added new capabilities which makes it easy for users to adopt and debug Istio and also reduced the configuration needed for users to get service mesh working in their Kubernetes environment. The full list of features and enhancements can be found in Istio’s release announcement, but there are few features that deserve deeper analysis.

Specifying Container Ports Is No Longer Required

Before release 1.3, Istio only intercepted inbound traffic on ports that were explicitly declared as part of the container spec in Kubernetes. This was often a cause of friction for adoption as Kubernetes doesn’t require container ports to be specified and by default forwards traffic to any unlisted port. Making this even worse, any unlisted inbound port bypassed the sidecar proxy (instead of being blocked) which created a potential security risk as bypassing the proxy meant no policies were being enforced. In this release, specifying container ports is no longer required and by default all ports are intercepted for traffic and redirected to sidecar proxy which means misconfiguration will no longer lead to security violations! If for some reason, you would still like to explicitly specify inbound ports instead of capturing all which we highly recommend) you can use the annotation “traffic.sidecar.istio.io/includeInboundPorts” on the pod spec.

Protocol Detection

In earlier versions of Istio, all service port names were required to be explicitly named with the protocol prefix (http-, grpc-, tcp-, etc) to declare the protocol being used by the service port. In the absence of a prefix, traffic was classified as TCP which meant a loss in visibility (metrics/tracing). It was also possible to bypass policy if a user  had configured HTTP or Layer 7 policies thinking that the application was accepting Layer 7 but the mesh was classifying it as TCP traffic. Experienced users of Kubernetes who already had a lot of existing configuration had to migrate their service definitions to add this prefix which lead to a lot of missing configurations and adoption burden. In release 1.3, an experimental protocol detection feature was added which doesn’t require users to prefix the service port name for HTTP traffic. Note that this feature is experimental and only works for HTTP traffic - for all other protocols you still need to add the prefix on the port names. Protocol detection is a useful functionality which can reduce configuration burden for users but it can interact with policies and routing in unexpected ways. We are working with the Istio community to iron out these interactions and will be publishing a blog soon on best recommended practices for production usage. In the meantime, this feature is disabled by default in the Aspen Mesh release and we  encourage our customers to enable this only in staging environments. Additionally, for Aspen Mesh customers, we automatically run the service port prefix vetter and notify you if any service in the mesh has ports with missing protocol prefixes.

Mixer-less Telemetry

Earlier versions of Istio had a control plane component, Mixer, which was responsible for receiving attributes about traffic from sidecar proxies in the client and server workloads and exposing it to a telemetry backend system like Prometheus or DataDog. This architecture was great for providing an abstraction layer for operators to switch out telemetry backend systems, but this component often became a choke point which required a large amount of resources (CPU/memory) which made it expensive for operators to manage Istio. In this release, an experimental feature was added which doesn’t require running Mixer to capture telemetry. In this mode, the sidecar proxies expose the metrics directly, which can be scraped by Prometheus. This feature is disabled by default and under active development to make sure users get the same metrics with and without Mixer. This page documents how to enable and use this feature  if you’re interested in trying it out.

Telemetry for External Services

Depending on your global settings i.e. whether to allow any external service access or block all traffic without explicit ServiceEntries, there were gaps in telemetry when external traffic was either blocked or allowed. Having visibility into external services is one of the key benefits of a service mesh and the new functionality added in release 1.3 allows you to monitor all external service traffic in either of the modes. It was a highly requested feature both from our customers and other production users of Istio, and we were pleased to  contribute this functionality to open source Istio. This blog documents how the augmented metrics can be used to better understand external service access. Note that all Aspen Mesh releases by default block all external service access, which we recommend, unless explicitly declared via ServiceEntries.

We hope that these new features simplify configuration needed to adopt Aspen Mesh and the enhanced User Experience makes it easy for you to navigate the complexities of a microservices environment. You can get the latest release here or if you’re an existing customer please follow the upgrade instructions in our documentation to switch to this version.

 


On Silly Animals and Gray Codes

I love Information Theory. This is a random rumination on surprise.  

Helm (v2) is a templating engine and release manager for Kubernetes.  Basically it lets you leverage the combined knowledge of experts on how you should configure container software, but still gives you nerd knobs you can tweak as needed. When Helm deploys software, it's called a release. You can name your releases, like ingress-controller-for-prod.  You'll use this name later: "Hey, Helm, how is ingress-controller-for-prod doing?" or "Hey, Helm, delete all the stuff you made for ingress-controller-for-prod."

If you don't name a release, Helm will make up a release name for you. It's a combination of an adjective and an animal:

"Monicker ships with a couple of word lists that were written and approved by a group of giggling school children (and their dad). We built a lighthearted list based on animals and descriptive words (mostly adjectives)."

So if you don't pick a name, Helm will pick one for you. You might get jaunty ferret or gauche octopus. Helm could have decided to pick unique identifiers, say UUIDs, so instead of jaunty ferret you get 9fa485b1-6e8b-47c4-baa1-3923394382a5 or e0c2def3-bc94-44ff-b702-985d4eb38ded. To Helm itself, the UUIDs would be fine. To the humans, though, I argue 9fa485b1-6e8b-47c4-baa1-3923394382a5 is a bad option because our brains aren't good handlers of long strings like 9fa485b1-6e8b-47c4-baa1-3932394382a5; it's hard to say 9fa485b1-6e8b-47e4-baa1-3923394382a5 and you're not even going to notice that I've actually subtly mixed up digits in 9fa485b1-6e8b-47c4-baa1-3923393482a5 through this entire paragraph.  But if I had mixed up jaunty ferret and jumpy ferret you at least stand a chance. This is true even though the bitwise difference between the inputs that generated jaunty ferret and jumpy ferret is actually smaller than my UUID tricks.

Humans are awful at handling arbitrarily long numbers. We can't fake them well. We get dazzled by them. We are miserable at comparing even short numbers, sometimes people die as a result.

So, if you're building identifiers into a system, you should consider if those are going to be seen by humans. And if so, I think you should make those identifiers suitable for humans: distinctive and pronounceable.

I've seen this used elsewhere; Docker does it for container names (but scientists and hackers instead of animals).  Netlify and Github will do it for project names.  LastPass has a "Pronounceable" option and pwgen walks a fine line; they explicitly trade a little entropy to avoid users "simply writ[ing] the password on a piece of paper taped to the monitor..." in the hell that is modern user/password management. I've also worked with a respected support organization that does this for customer issues (and all the humans seemed to be massively more effective IMing/emailing/Wiki-writing/Chatting in the hall about names instead of 10-digit numbers).

Aspen Mesh does this in a few places. The first benefit is some great GIFs. On our team, if Randy asks you to fix something in the "singing clams" object, he'll Slack you a GIF as well. The second benefit is distinctiveness - after you've seen a GIF of singing clams, the likelihood you accidentally delete the boasting aardvark object is basically nil. The likelihood that your dreams are haunted by singing clams is an entirely different concern.

via GIPHY

So I argue that replacing numbers with pronounceable and memorable human-language identifiers is great when we need things to be distinguishable and possible to remember. Humans are too easily tricked by subtle changes in long numbers.

An added bonus that we enjoy is that we bring some of our most meaningful cluster names to life at Aspen Mesh. Our first development cluster, our first production cluster and our first customer cluster all have a special place in our hearts. Naturally, we took those cluster names and made them into Aspen Mesh mascots:

  • jaunty-ferret
  • gauche-octopus
  • jolly-bat

Our cluster names make it easier for us to get development work done, and come with the added bonus of making the office more fun. If you want a set of these awesome cluster animals, leave a comment or tweet us @AspenMesh and we’ll send you a sticker pack. 


Aspen Mesh 1.2.7 Security Update

Aspen Mesh is announcing the release of 1.2.7 which addresses important Istio security updates. Below are the details of the security fixes taken from Istio 1.2.7 security update.

Security Update 

ISTIO-SECURITY-2019-005: A DoS vulnerability has been discovered by the Envoy community. 

  • CVE-2019-15226: After investigation, the Istio team has found that this issue could be leveraged for a DoS attack in Istio if an attacker uses a high quantity of very small headers.

Bug Fix

  • Fix a bug where nodeagent was failing to start when using citadel (Issue 15876)

Additionally the Aspen Mesh 1.2.7 release contains bug fixes and enhancements from Istio release 1.2.6.  

The Aspen Mesh 1.2.7 binaries are available for download here

 For upgrading procedures of Aspen Mesh deployments installed via Helm (helm upgrade) please visit our Getting Started page.


How to Debug Istio Mutual TLS (mTLS) Policy Issues Using Aspen Mesh

Users Care About Secure Service to Service Communication

Mutual TLS (mTLS) communication between services is a key Istio feature driving adoption as applications do not have to be altered to support it. mTLS provides client and server side security for service to service communications, enabling organizations to enhance network security with reduced operational burden (e.g. certificate management is handled by Istio). If you are interested in learning more about this, checkout Istio's mTLS docs here. From regulatory concerns to auditing requirements and a host of other reasons businesses need to demonstrate they are following burgeoning security practices in a microservice landscape.

Many techniques evolved to help ease this requirement and enable businesses to focus on business value. Unfortunately, many of these techniques require expertise to ensure they are developed or configured properly, from IPSec to a wide range of other solutions. Unless you are a security expert, it is challenging to implement these techniques correctly. Managing ciphers, algorithms, rotating keys, certificates, and updating system libraries when CVEs are found is difficult for software developers, DevOps and sys admins to keep abreast of. Even seasoned security professionals can find it difficult to implement and audit such systems. As security is a core feature, this is where a service mesh like Aspen Mesh can help. A service mesh helps to alleviate these concerns with the goal of drastically lessening the burden of securing and auditing such systems, enabling users to focus on their core products.

Gradually Adopting mTLS Within Istio

At Aspen Mesh we recommend installing Istio with global mTLS enabled. However, very few deployments of Istio are in green-field environments where services are slowly adopted, created and can be monitored independently before new services are rolled out. In these cases, users will most likely adopt mTLS gradually service-by-service and will carefully monitor traffic behavior before proceeding to the next service.

A common problem that many users experience when enabling mTLS for service communication in their service mesh is inadvertently breaking traffic. A misconfigured AuthenticationPolicy or DestinationRule can affect communication unbeknownst to a user until other issues arise.

It is difficult to monitor for such specific failures because they occur at the transport layer (L4) where raw TCP connections are first established by the underlying OS after which the TLS handshake takes places. If a problem happens during this handshake the Envoy sidecar is not be able to create detailed diagnostic metrics and messaging as this error is not at the application layer (L7). While 503 errors can surface due to misconfiguration, a 503 alone is not specific enough to determine if the issue is due to misconfiguration or a misbehaving service. We are working with the Istio community to add telemetry for traffic failures related to mTLS misconfiguration. This requires surfacing the relevant information from Envoy which we are collaborating with in this pull request. Until such capability exists there are techniques and tools which we will discuss to aid you in debugging traffic management issues.

At Aspen Mesh we are trying to enable our users to feel confident in their ability to manage their infrastructure. Kubernetes, Istio and Aspen Mesh are the platform, but business value is derived from software written and configured in-house, so quickly resolving issues is paramount to our customer's success.

Debugging Policy Issues With Aspen Mesh

We will now walk-through debugging policy issues when using Aspen Mesh. Many of the following techniques are relevant to Istio in case you don't have Aspen Mesh installed.

In the below example, bookinfo was installed into the bookinfo namespace using Aspen Mesh with global mTLS set to PERMISSIVE. We then deployed three deployments that communicated with the productpage service spanning three different namespaces.

A namespace policy was created to set mTLS to be STRICT. However, no DestinationRules were created and as a result the system started to experience mTLS errors.

The graph suggests that there is a problem with traffic generator communicating with productpage. We will first inspect policy settings and logs of our services.

Determining the DestinationRule for a workload and an associated service is pretty straight-forward.

$ istioctl authn tls-check -n bookinfo traffic-generator-productpage-6b88d69f-xxfkn productpage.bookinfo.svc.cluster.local

where traffic-generator-productpage-6b88d69f-xxfkn is the name of a pod within the bookinfo namespace and productpage.bookinfo.svc.cluster.local is the server. The output will be similar to the following:

HOST:PORT                                       STATUS       SERVER     CLIENT     AUTHN POLICY         DESTINATION RULE
productpage.bookinfo.svc.cluster.local:9080     CONFLICT     mTLS       HTTP       default/bookinfo     destrule-productpage/bookinfo

if no conflict is found then the STATUS column will say OK, but for this example there is a conflict that exists between the AuthenticationPolicy and the DestinationRule. Inspecting the output closely, we see that there is a namespace wide AuthenticationPolicy used--determined by its name of default--and what appears, by name, to be a host-specific DestinationRule.

Using kubectl we can directly inspect the contents of the DestinationRule:

$ kubectl get destinationrule -n bookinfo destrule-productpage -o yaml
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"networking.istio.io/v1alpha3","kind":"DestinationRule","metadata":{"annotations":{},"name":"destrule-productpage","namespace":"bookinfo"},"spec":{"exportTo":["*"],"host":"productpage","trafficPolicy":{"tls":{"mode":"DISABLE"}}}}
  creationTimestamp: "2019-10-10T20:24:48Z"
  generation: 1
  name: destrule-productpage
  namespace: bookinfo
  resourceVersion: "4874298"
  selfLink: /apis/networking.istio.io/v1alpha3/namespaces/bookinfo/destinationrules/destrule-productpage
  uid: 01612af7-eb9c-11e9-a719-06457fb661c2
spec:
  exportTo:
  - '*'
  host: productpage
  trafficPolicy:
    tls:
      mode: DISABLE

A conflict does exist and we can simply fix this by altering our DestinationRule to have a mode of ISTIO_MUTUAL instead of DISABLE. For this example it was a fairly simple fix. At times, however, it is possible that you may see a DestinationRule that is different from the one you expect. Reasoning about the correct DestinationRule object can be difficult without first knowing the resolution hierarchy established by Istio. In our example, the DestinationRule above also applies to the traffic-generator workloads in the other namespaces as well.

DestinationRule Hierarchy Resolution

When Istio configures the sidecars for service to service communication, it must make a determination on which DestinationRule, if any, should be used to handle communication between each service. When a client attempts to contact a server, the client's request is first routed to its sidecar and that sidecar inspects its configuration to determine the method by which it should establish a communication with the server's sidecar.

The rules by which Istio creates these sidecar configurations are as follows: clients first look for DestinationRules in their own namespace that match the FQDN of the requested server. If no DestinationRule is found then the server's namespace is checked for a DestinationRule; again, if no DestinationRule is found then the Istio root namespace (default is istio-system) is checked for a matching DestinationRule.

DestinationRules that use wildcards, specific ports and/or make use of exportTo can further make it arduous to determine DestinationRule resolution. Istio has set of guidelines to help users adopt rule changes found here.

It is also worthwhile to note that when a new DestinationRule is created to adhere to an AuthenticationPolicy change, it is important to keep any previously applied traffic rules, otherwise a behavioral change in service communication within your system may be experienced. For instance, if load balancing was previously LEAST_CONN for service to service communication due to a client namespace DestinationRule targeted for another namespace, then the new DestinationRule should inherit the load balancing setting, or the user will see a behavioral change in traffic patterns within their service mesh as load balancing for that service will use the default setting, ROUND_ROBIN.

Our product helps simplify this by respecting the rules set by Istio in 1.1.0+ and inspecting existing AuthenticationPolicies and DestinationRules when creating new ones.

Even so, it is best to use the fewest number of DestinationRules possible in a service mesh. While it is an incredibly powerful feature, it's best used with discretion and intent.

Debugging Traffic Issues

Besides globally enabling mTLS and setting the outbound traffic policy to be more restrictive, we also recommend setting global.proxy.accessLogFile to log to /dev/stdout instead of /dev/null. This will enable you to view the access logs from the Envoy sidecar within your cluster when debugging Istio configuration and policy issues.

After applying an AuthenticationPolicy or a DestinationRule it is possible that 503 HTTP Status codes will start appearing. Here are a couple of checks to aid you in diagnosing the issue to see if it is related to an mTLS issue.

First, we we will repeat what we did above, with the name of the POD being the pod seeing the 503 HTTP Status return codes:

$ istioctl authn tls-check <PODNAME> <DESTINATION SERVICE FQDN FORMAT>

In most cases this will be all of the debugging you will have to do. However, we can also dig deeper to understand the issue and it never hurts to know more about the underlying infrastructure of your system.

Remember that in a distributed system changes may take a while to propagate through a system and both Pilot and Mixer are responsible for passing configuration and enforcing policy, respectively. Let's start looking at some logs and configuration of sidecars.

By enabling proxy access logs we can view them directly:

$ kubectl logs -n <POD NAMESPACE> <PODNAME> -c istio-proxy 

where you may see logs similar to the following:

[2019-10-07T21:54:37.175Z] "GET /productpage HTTP/1.1" 503 UC "-" "-" 0 95 1 - "-" "curl/7.35.0" "819c2e8b-ddad-4579-8508-794ab7de5a55" "productpage:9080" "XXX.XXX.XXX.XXX:9080" outbound|9080||productpage.bookinfo.svc.cluster.local - XXX.XXX.XXX.XXX:9080 XXX.XXX.XXX.XXX:33834 -
[2019-10-07T21:54:38.188Z] "GET /productpage HTTP/1.1" 503 UC "-" "-" 0 95 1 - "-" "curl/7.35.0" "290b42e7-5140-4881-ae87-778b352adcad" "productpage:9080" "XXX.XXX.XXX.XXX:9080" outbound|9080||productpage.bookinfo.svc.cluster.local - XXX.XXX.XXX.XXX:9080 XXX.XXX.XXX.XXX:33840 -

It is important to note the 503 UC in the above access logs. The UC according to Envoy's documentation states that UC means Upstream connection termination in addition to 503 response code. This helps us understand that it is likely to be an mTLS issue.

If the containers inside of your service mesh contain curl (or equivalent) you can also run the following command within a pod that is experiencing 503s:

$ kubectl exec -c <CONTAINER> <PODNAME> -it -- curl -vv http://<DESTINATIONSERVICE FQDN>:PORT

which may then output something akin to

* Rebuilt URL to: http://productpage.bookinfo.svc.cluster.local:9080/
* Hostname was NOT found in DNS cache
*   Trying XXX.XXX.XXX.XXX...
* Connected to productpage.bookinfo.svc.cluster.local (XXX.XXX.XXX.XXX) port 9080 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.35.0
> Host: productpage.bookinfo.svc.cluster.local:9080
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-length: 95
< content-type: text/plain
< date: Mon, 07 Oct 2019 22:09:23 GMT
* Server envoy is not blacklisted
< server: envoy
<
* Connection #0 to host productpage.bookinfo.svc.cluster.local left intact
upstream connect error or disconnect/reset before headers. reset reason: connection termination

The last line is what's important. HTTP headers were not able to be sent before the underlying TCP connection was terminated. This is a very strong indication that the TLS handshake failed.

And lastly, you can inspect the configuration sent by Pilot to your pod's sidecar using istioctl.

$ istioctl proxy-config cluster -n <POD NAMESPACE> <PODNAME> -o json

whereby if you search for the destination service name you will see an embedded metadata JSON element that names the specific DestinationRule that pod is currently using to communicate with the external service.

{
    "metadata": {
      "filterMetadata": {
        "istio": {
          "config": "/apis/networking/v1alpha3/namespaces/traffic-generator/destination-rule/named-destrule"
        }
      }
    }
}

If you look closely at that returned object you can also inspect and verify rules being applied. The source of truth for a given moment is always found in your pod's Envoy sidecar configuration so while you don't need to become an expert and learn all of the nuances of debugging Istio it is another tool in your debugging toolbelt.

The Future

Istio is a incredibly sophisticated and powerful tool. Similar to other such tools, it requires expertise to get the most out of it, but the rewards are greater than the challenge. Aspen Mesh is committed to enabling Istio and our customers to succeed. As our platform matures, we will continue to help users by surfacing use cases and examples like in the above service graph, along with further in-depth ways to diagnose and troubleshoot issues. Lowering the mean time to detect (MTTD) and mean time to resolve (MTTR) for our users is critical to their success.

There are some exciting things that Aspen Mesh is planning to help our users tackle some of the hurdles we've found when adopting Istio. Keep an eye on our blog for future announcements.