Recommended Remediation for Kubernetes CVE-2019-11247

A Kubernetes vulnerability CVE-2019-11247 was announced this week.  While this is a vulnerability in Kubernetes itself, and not Istio, it may affect Aspen Mesh users. This blog will help you understand possible consequences and how to remediate.

  • You should mitigate this vulnerability by updating to Kubernetes 1.13.9, 1.14.5 or 1.15.2 as soon as possible.
  • This vulnerability affects the interaction of Roles (a Kubernetes RBAC resource) and CustomResources (Istio uses CustomResources for things like VirtualServices, Gateways, DestinationRules and more).
  • Only certain definitions of Roles are vulnerable.
  • Aspen Mesh's installation does not define any such Roles, but you might have defined them for other software in your cluster.
  • More explanation and recovery details below.

Explanation

If you have a Role defined anywhere in your cluster with a "*" for resources or apiGroups, then anything that can use that Role could escalate to modify many CustomResources.

This Kubernetes issue and this helpful blog have extensive details and an example walkthrough that we'll summarize here.  Kubernetes Roles define sets of permissions for resources in one particular namespace (like "default").  They are not supposed to define permissions for resources in other namespaces or resources that live globally (outside of any namespace); that's what ClusterRoles are for.

Here's an example Role from Aspen Mesh that says "The IngressGateway should be able to get, watch or list any secrets in the istio-system namespace", which it needs to bootstrap secret discovery to get keys for TLS or mTLS:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: istio-ingressgateway-sds
  namespace: istio-system
rules:
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get", "watch", "list"]

This Role does not grant permissions to secrets in any other namespace, or grant permissions to anything aside from secrets.   You'd need additional Roles and RoleBindings to add those. It doesn't grant permissions to get or modify cluster-wide resources.  You'd need ClusterRoles and ClusterRoleBindings to add those.

The vulnerability is that if the Role grants access to CustomResources in one namespace, it accidentally grants access to the same kinds of CustomResources that exist at global scope.  This is not exploitable in many cases because if you have a namespace-scoped CustomResource called "Foo", you can't also have a global CustomResource called "Foo", so there are no global-scope Foos to attack.  Unfortunately, if your role allows access to a resource "*" or apiGroup "*", then the "*" matches both namespace-scoped Foo and globally-scoped Bar in vulnerable versions of Kubernetes. This Role could be used to attack Bar.

If you're really scrutinizing the above example, note that the apiGroup "" is different than the apiGroup "*": the empty "" refers to core Kubernetes resources like Secrets, while "*" is a wildcard meaning any.

Aspen Mesh and Istio define three globally-scoped CustomResources: ClusterRbacConfig, MeshPolicy, ClusterIssuer.  If you had a vulnerable Role defined and an attacker could assume that Role, then they could have modified those three CustomResources.  Aspen Mesh does not provide any vulnerable Roles so a cluster would need to have those Roles defined for some other purpose.

Recovery

First and as soon as possible, you should upgrade Kubernetes to a non-vulnerable version: 1.13.9, 1.14.5 or 1.15.2.

When you define Roles and ClusterRoles, you should follow the Principle of Least Privilege and avoid granting access to "*" - always prefer listing the specific resources required.

You can examine your cluster for vulnerable Roles.  If any exist, or have existed, those could have been exploited by an attacker with knowledge of this vulnerability to modify Aspen Mesh configuration.  To mitigate this, recreate any CustomResources after upgrading to a non-vulnerable version of Kubernetes.

This snippet will print out any vulnerable Roles currently configured, but cannot tell you if any may have existed in the past.  It relies on the jq tool:

kubectl get role --all-namespaces -o json |jq '.items[] | select(.rules[].resources | index("*"))'
kubectl get role --all-namespaces -o json |jq '.items[] | select(.rules[].apiGroups | index("*"))'

This is not a vulnerability in Aspen Mesh, so there is no need to upgrade Aspen Mesh.


Simplifying Microservices Security with Incremental mTLS

Kubernetes removes much of the complexity and difficulty involved in managing and operating a microservices application architecture. Out of the box, Kubernetes gives you advanced application lifecycle management techniques like rolling upgrades, resiliency via pod replication, auto-scalers and disruption budgets, efficient resource utilization with advanced scheduling strategies and health checks like readiness and liveness probes. Kubernetes also sets up basic networking capabilities which allow you to easily discover new services getting added to your cluster (via DNS) and enables pod to pod communication with basic load balancing.

However, most of the networking capabilities provided by Kubernetes and it’s CNI providers are constrained to layer 3/4 (networking/protocols like TCP/IP) of the OSI stack. This means that any advanced networking functionality (like retries or routing) which relies on higher layers i.e. parsing application protocols like HTTP/gRPC (layer 7) or encrypting traffic between pods using TLS (layer 5) has to be baked into the application. Relying on your applications to enforce network security is often fraught with landmines related to close coupling of your operations/security and development teams and at the same time adding more burden on your application developers to own complicated infrastructure code.

Let’s explore what it takes for applications to perform TLS encryption for all inbound and outbound traffic in a Kubernetes environment. In order to achieve TLS encryption, you need to establish trust between the parties involved in communication. For establishing trust, you need to create and maintain some sort of PKI infrastructure which can generate certificates, revoke them and periodically refresh them. As an operator, you now need a mechanism to provide these certificates (maybe use Kubernetes secrets?) to the running pods and update the pods when new certificates are minted. On the application side, you have to rely on OpenSSL (or its derivatives) to verify trust and encrypt traffic. The application developer team needs to handle upgrading these libraries when CVE fixes and upgrades are released. In addition to all these complexities, compliance concerns may also require you only support a TLS version (or higher) and subset of ciphers, which requires creating and supporting more configuration options in your applications. All of these challenges make it very hard for organizations to encrypt all pod network traffic on Kubernetes, whether it’s for compliance reasons or achieving a zero trust network model.

This is the problem that a service mesh leveraging the sidecar proxy approach is designed to solve. The sidecar proxy can initiate a TLS handshake and encrypt traffic without requiring any changes or support from the applications. In this architecture, the application pod makes a request in plain text to another application running in the Kubernetes cluster which the sidecar proxy takes over and transparently upgrades to use mutual TLS. Additionally, the Istio control plane component Citadel handles creating workload identities using the SPIFFE specification to create and renew certificates and mount the appropriate certificates to the sidecars. This removes the burden of encrypting traffic from developers and operators.

Istio provides a rich set of tools to configure mutual TLS globally (on or off) for the entire cluster or incrementally enabling mTLS for namespaces or a subset of services and its clients and incrementally adopting mTLS. This is where things get a little complicated. In order to correctly configure mTLS for one service, you need to configure an Authentication Policy for that service and the corresponding DestinationRules for its clients.

Both the Authentication policy and Destination rule follow a complex set of precedence rules which must be accounted for when creating these configuration objects. For example, a namespace level Authentication policy overrides the mesh level global policy, a service level policy overrides the namespace level and a service port level policy overrides the service specific Authentication policy. Destination rules allow you to specify the client side configuration based on host names where the highest precedence is the Destination rule defined in the client namespace then the server namespace and finally the global default Destination rule. On top of that, if you have conflicting Authentication policies or Destination rules, the system behavior can be indeterminate. A mismatch in Authentication policy and Destination rule can lead to subtle traffic failures which are difficult to debug and diagnose. Aspen Mesh makes it easy to understand mTLS status and avoid any configuration errors.

Editing these complex configuration files in YAML can be tricky and only compound the problem at hand. In order to simplify how you configure these resources and incrementally adopt mutual TLS in your environment, we are releasing a new feature which enables our customers to specify a service port (via APIs or UI) and their desired mTLS state (enabled or disabled). The Aspen Mesh platform automatically generates the correct set of configurations needed (Authentication policy and/or Destination rules) by inspecting the current state and configuration of your cluster. You can then view the generated YAMLs, edit as needed and store them in your CI system or apply them manually as needed. This feature removes the hassle of learning complex Istio resources and their interaction patterns, and provides you with valid, non-conflicting and functional Istio configuration.

Customers that we talk to are in various stages of migrating to a microservices architecture or Kubernetes environment which results in a hybrid environment where you have services which are consumed by clients not in the mesh or are deployed outside the Kubernetes environment, so some services require a different mTLS policy. Our hosted dashboard makes it easy for users to identify services and workloads which have mTLS turned on or off and then easily create configuration using the above workflow to change the mTLS state as needed.

If you’re an existing customer, please upgrade your cluster to our latest release (Aspen Mesh 1.1.3-am2) and login to the dashboard to start using the new capabilities.

If you’re interested in learning about Aspen Mesh and incrementally adopting mTLS in your cluster, you can sign up for a beta account here.


To Multicluster, or Not to Multicluster: Solving Kubernetes Multicluster Challenges with a Service Mesh

If you are going to be running multiple clusters for dev and organizational reasons, it’s important to understand your requirements and decide whether you want to connect these in a multicluster environment and, if so, to understand various approaches and associated tradeoffs with each option.

Kubernetes has become the container orchestration standard, and many organizations are currently running multiples clusters. But while communication issues within clusters are largely solved, communication across clusters is still a major challenge for most organizations.

Service mesh helps to address multicluster challenges. Start by identifying what you want, then shift to how to get it. We recommend understanding your specific communication use case, identifying your goals, then creating an implementation plan.

Multicluster offers a number of benefits:

  • Single pane of glass
  • Unified trust domain
  • Independent fault domains
  • Intercluster traffic
  • Heterogenous/non-flat network

Which can be achieved with various approaches:

  • Independent clusters
  • Common management
  • Cluster-aware service routing through gateways
  • Flat network
  • Split-horizon Endpoints Discovery Service (EDS)

If you have decided to multicluster, your next move is deciding the best implementation method and approach for your organization. A service mesh like Istio can help, and when used properly can make multicluster communication painless.

Read the full article here on InfoQ’s site.

 


Why Service Meshes, Orchestrators Are Do or Die for Cloud Native Deployments

The self-contained, ephemeral nature of microservices comes with some serious upside, but keeping track of every single one is a challenge, especially when trying to figure out how the rest are affected when a single microservice goes down. The end result is that if you’re operating or developing in a microservices architecture, there’s a good chance part of your days are spent wondering what the hell your services are up to.

With the adoption of microservices, problems also emerge due to the sheer number of services that exist in large systems. Problems like security, load balancing, monitoring and rate limiting that had to be solved once for a monolith, now have to be handled separately for each service.

The good news is that engineers love a good challenge. And almost as quickly as they are creating new problems with microservices, they are addressing those problems with emerging microservices tools and technology patterns. Maybe the emergence of microservices is just a smart play by engineers to ensure job security.

Today’s cloud native darling, Kubernetes, eases many of the challenges that come with microservices. Auto-scheduling, horizontal scaling and service discovery solve the majority of build-and-deploy problems you’ll encounter with microservices.

What Kubernetes leaves unsolved is a few key containerized application runtime issues. That’s where a service mesh steps in. Let’s take a look at what Kubernetes provides, and how Istio adds to Kubernetes to solve the microservices runtime issues.

Kubernetes Solves Build-and-Deploy Challenges

Managing microservices at runtime is a major challenge. A service mesh helps alleviate this challenge by providing observability, control and security for your containerized applications. Aspen Mesh is the fully supported distribution of Istio that makes service mesh simple and enterprise-ready.

Kubernetes supports a microservice architecture by enabling developers to abstract away the functionality of a set of pods, and expose services to other developers through a well-defined API. Kubernetes enables L4 load balancing, but it doesn’t help with higher-level problems, such as L7 metrics, traffic splitting, rate limiting and circuit breaking.

Service Mesh Addresses Challenges of Managing Traffic at Runtime

Service mesh helps address many of the challenges that arise when your application is being consumed by the end user. Being able to monitor what services are communicating with each other, if those communications are secure and being able to control the service-to-service communication in your clusters are key to ensuring applications are running securely and resiliently.

Istio also provides a consistent view across a microservices architecture by generating uniform metrics throughout. It removes the need to reconcile different types of metrics emitted by various runtime agents, or add arbitrary agents to gather metrics for legacy un-instrumented apps. It adds a level of observability across your polyglot services and clusters that is unachievable at such a fine-grained level with any other tool.

Istio also adds a much deeper level of security. While Kubernetes only provides basic secret distribution and control-plane certificate management, Istio provides mTLS capabilities so you can encrypt on the wire traffic to ensure your service-to-service communications are secure.

A Match Made in Heaven

Pairing Kubernetes with a service mesh-like Istio gives you the best of both worlds and since Istio was made to run on Kubernetes, the two work together seamlessly. You can use Kubernetes to manage all of your build and deploy needs and Istio takes care of the important runtime issues.

Kubernetes has matured to a point that most enterprises are using it for container orchestration. Currently, there are 74 CNCF-certified service providers — which is a testament to the fact that there is a large and growing market. I see Istio as an extension of Kubernetes and a next step to solving more challenges in what feels like a single package.

Already, Istio is quickly maturing and is starting to see more adoption in the enterprise. It’s likely that in 2019 we will see Istio emerge as the service mesh standard for enterprises in much the same way Kubernetes has emerged as the standard for container orchestration.


Running Stateful Apps with Service Mesh: Kubernetes Cassandra with Istio mTLS Enabled

Cassandra is a popular, heavy-load, highly performant, distributed NoSQL database.  It is fully integrated into many mainstay cloud and cloud-native architectures. At companies such as Netflix and Spotify, Cassandra clusters provide continuous availability, fault tolerance, resiliency and scalability.

Critical and sensitive data is sent to and from a Cassandra database.  When deployed in a Kubernetes environment, ensuring the data is secure and encrypted is a must.  Understanding data patterns and performance latencies across nodes becomes essential, as your Cassandra environment spans multiple datacenters and cloud vendors.

A service mesh provides service visibility, distributed tracing, and mTLS encryption.  

While it’s true Cassandra provides its own TLS encryption, one of the compelling features of Istio is the ability to uniformly administer mTLS for all of your services.  With a service mesh, you can set up an easy and consistent policy where Istio automatically manages the certificate rotation. Pulling Cassandra into a service mesh pairs capabilities of the two technologies in a way that makes running stateless services much easier.

In this blog, I’ll cover the steps necessary to configure Istio with mTLS enabled in a Kubernetes Cassandra environment.  We’ve collected some information from the Istio community, did some testing ourselves and pieced together a workable solution.  One of the benefits you get with Aspen Mesh is our Istio expertise from running Istio in production for the past 18 months.  We are tightly engaged with the Istio community and continually testing and working out the kinks of upstream Istio. We’re here to help you with your service mesh path to production!

Let’s consider how Cassandra operates.  To achieve continuous availability, Cassandra uses a “ring” communication approach.  Meaning each node communicates continually with the other existing nodes. For Cassandra’s node consensus, the nodes send metadata to several nodes through a service called a Gossip.  The receiving nodes then “gossip” to all the additional nodes. This Gossip protocol is similar to a TCP three-way handshake, and all of the metadata, like heartbeat state, node status, location, etc… is messaged across nodes via IP address:port.

In a Kubernetes deployment, Cassandra nodes are deployed as StatefulSets to ensure the allocated number of Cassandra nodes are available at all times. Persistent volumes are associated with the Cassandra StatefulSets, and a headless service is created to ensure a stable network ID.  This allows Kubernetes to restart a pod on another node and transfer its state seamlessly to the new node.

Now, here’s where it gets tricky.  When implementing an Istio service mesh with mTLS enabled, the Envoy sidecar intercepts all of the traffic from the Cassandra nodes, verifies where it’s coming from, decrypts and sends the payload to the Cassandra pod through an internal loopback address.   The Cassandra nodes are all listening on their Pod IPs for gossip. However, Envoy is forwarding only to 127.0.0.1, where Cassandra isn't listening. Let’s walk through how to solve this issue.

Setting up the Mesh:

We used the cassandra:v13 image from the Google repo for our Kubernetes Cassandra environment. There are a few things you’ll need to ensure are included in the Cassandra manifest at the time of deployment.  Within the Cassandra service, you'll need to set it to a headless service, or set clusterIP: None, and you have to allow some additional ports/port-names that Cassandra service will need to communicate with:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: cassandra
  namespace: cassandra
  name: cassandra
spec:
  clusterIP: None
  ports:
  - name: tcp-client
    port: 9042
  - port: 7000
    name: tcp-intra-node
  - port: 7001
    name: tcp-tls-intra-node
  - port: 7199
    name: tcp-jmx
  selector:
    app: cassandra

The next step is to tell each Cassandra node to listen to the Envoy loopback address.  

This image, by default, sets Cassandra’s listener to the Kubernetes Pod IP.  The listener address will need to be set to the localhost loopback address. This allows the Envoy sidecar to pass communication through to the Cassandra nodes.

To enable this you will need to change the config file for Cassandra or the cassandra.yaml.

We did this by adding a substitution to our Kubernetes Cassandra manifest based on the Istio bug:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: cassandra
  name: cassandra
  labels:
    app: cassandra
spec:
  serviceName: cassandra
  replicas: 3
  selector:
    matchLabels:
      app: cassandra
  template:
    metadata:
      labels:
        app: cassandra
    spec:
      terminationGracePeriodSeconds: 1800
      containers:
      - name: cassandra
        image: gcr.io/google-samples/cassandra:v13
        command: [ "/usr/bin/dumb-init", "/bin/bash", "-c", "sed -i 's/^CASSANDRA_LISTEN_ADDRESS=.*/CASSANDRA_LISTEN_ADDRESS=\"127.0.0.1\"/' /run.sh && /run.sh" ]
        imagePullPolicy: Always
        ports:
        - containerPort: 7000
          name: intra-node
        - containerPort: 7001
          name: tls-intra-node
        - containerPort: 7199
          name: jmx
        - containerPort: 9042

This simple change uses sed to patch the cassandra startup script to listen on localhost.  

If you're not using the google-samples/cassandra container you should modify your Cassandra config or container to set the listen_address to 127.0.0.1.  For some containers, this may already be the default.

You'll need to remove any ServiceEntry or VirtualService resources associated with the Cassandra deployment as no additional specified routing entries or rules are necessary.  Nothing external is needed to communicate, Cassandra is now inside the mesh and communication will simply pass through to each node.

Since the clusterIP is set to none for the Cassandra Service will be configured as a headless service (i.e. setting the clusterIP: None) a DestinationRule does not need to be added.  When there is no clusterIP assigned, Istio defines load balancing mode as PASSTHROUGH by default.

If you are using Aspen Mesh, the global meshpolicy has mTLS enabled by default, so no changes are necessary.

$ kubectl edit meshpolicy default -o yaml
apiVersion: authentication.istio.io/v1alpha1
kind: MeshPolicy
.
. #edited out
.
spec:
  peers:
  - mtls: {}

Finally, create a Cassandra namespace, enable automatic sidecar injection and deploy Cassandra.

$ kubectl create namespace cassandra
$ kubectl label namespace cassandra istio-injection=enabled
$ kubectl -n cassandra apply -f <Cassandra-manifest>.yaml

Here is the output that shows the Cassandra nodes running with Istio sidecars.

$ kubectl get pods -n cassandra                                                                                   
NAME                     READY     STATUS    RESTARTS   AGE
cassandra-0              2/2       Running   0          22m
cassandra-1              2/2       Running   0          21m
cassandra-2              2/2       Running   0          20m
cqlsh-5d648594cb-86rq9   2/2       Running   0          2h

Here is the output validating mTLS is enabled.

$ istioctl authn tls-check cassandra.cassandra.svc.cluster.local

 
HOST:PORT           STATUS     SERVER     CLIENT     AUTHN POLICY     DESTINATION RULE
cassandra...:7000       OK       mTLS       mTLS         default/ default/istio-system

Here is the output validating the Cassandra nodes are communicating with each other and able to establish load-balancing policies.

$ kubectl exec -it -n cassandra cassandra-0 -c cassandra -- nodetool status
Datacenter: DC1-K8Demo
======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load     Tokens  Owns (effective)  Host ID            Rack
UN  100.96.1.225  129.92 KiB  32   71.8%       f65e8c93-85d7-4b8b-ae82-66f26b36d5fd Rack1-K8Demo
UN  100.96.3.51   157.68 KiB  32   55.4%       57679164-f95f-45f2-a0d6-856c62874620  Rack1-K8Demo
UN  100.96.4.59   142.07 KiB  32   72.8%       cc4d56c7-9931-4a9b-8d6a-d7db8c4ea67b  Rack1-K8Demo

If this is a solution that can make things easier in your environment, sign up for the free Aspen Mesh Beta.  It will guide you through an automated Istio installation, then you can install Cassandra using the manifest covered in this blog, which can be found here.   


Advancing the promise of service mesh: Why I work at Aspen Mesh

The themes and content expressed are mine alone, with helpful insight and thoughts from my colleagues, and are about software development in a business setting.

I’ve been working at Aspen Mesh for a little over a month and during that time numerous people have asked me why I chose to work here, given the opportunities in Boulder and the Front Range.

To answer that question, I need to talk a bit about my background. I’ve been a professional software developer for about 13 years now. During that time I’ve primarily worked on the back-end for distributed systems and have seen numerous approaches to the same problems with various pros and cons. When I take a step back, though, a lot of the major issues that I’ve seen are related to deployment and configuration around service communication:
How do I add a new service to an already existing system? How big, in scope, should these services be? Do I use a message broker? How do I handle discoverability, high availability and fault tolerance? How, and in what format, should data be exchanged between services? How do I audit the system when the system inevitably comes under scrutiny?

I’ve seen many different approaches to these problems. In fact, there are so many approaches, some orthogonal and some similar, that software developers can easily get lost. While the themes are constant, it is time consuming for developers to get up to speed with all of these technologies. There isn’t a single solution that solves every common problem seen in the backend; I’m sure the same applies to the front-end as well. It’s hard to truly understand the pros and cons of an approach until you have a working system; and when that happens and if you then realize that the cons outweigh the pros, it may be difficult and costly to get back to where you started (see sunk cost fallacy and opportunity cost). Conversely, analysis paralysis is also costly to an organization, both in terms of capital—software developers are not cheap—and an inability to quickly adapt to market pressures, be it customer needs and requirements or a competitor that is disrupting the market.

Yet the hype cycle continues. There is always a new shiny thing taking the software world by storm. You see it in discussions on languages, frameworks, databases, messaging protocols, architectures ad infinitum. Separating the wheat from the chaff is something developers must do to ensure they are able to meet their obligations. But with the signal to noise ratio being high at times and with looming deadlines not all possibilities can be explored.  

So as software developers, we have an obligation of due diligence and to be able to deliver software that provides customer value; that helps customers get their work done and doesn’t impede them, but enables them. Most customers don’t care about which languages you use or which databases you use or how you build your software or what software process methodology you adhere to, if any. They just want the software you provide to enable them to do their work. In fact, that sentiment is so strong that slogans have been made around it.

So what do customers care about, generally speaking? They care about access to their data, how they can view it and modify it and draw value from it. It should look and feel modern, but even that isn’t a strict requirement. It should be simple to use for a novice, but yet provide enough advanced capability to help your most advanced users make you learn something new about the tool you’ve created. This is information technology after all. Technology for technology’s sake is not a useful outcome.

Any work that detracts from adding customer value needs to be deprioritized, as there is always more work to do than hours in the day. As developers, it’s our job to be knee deep in the weeds so it’s easy to lose sight of that; unit testing, automation, language choice, cloud provider, software process methodology, etc… absolutely matter, but that they are a means to an end.

With that in mind, let’s create a goal: application developers should be application developers.

Not DevOps engineers, or SREs or CSRs, or any other myriad of roles they are often asked to take on. I’ve seen my peers happiest when they are solving difficult problems and challenging themselves. Not when they are figuring out what magic configuration setting is breaking the platform. Command over their domain and the ability and permission to “fix it” is important to almost every appdev.

If developers are expensive to hire, train, replace and keep then they need to be enabled to do their job to the best of their ability. If a distributed, microservices platform has led your team to solving issues in the fashion of Sherlock Holmes solving his latest mystery, then perhaps you need a different approach.

Enter Istio and Aspen Mesh

It’s hard to know where the industry is with respect to the Hype Cycle for technologies like microservices, container orchestration, service mesh and a myriad of other choices; this isn’t an exact science where we can empirically take measurements. Most companies have older, but proven, systems built on LAMP or Java application servers or monoliths or applications that run on a big iron system. Those aren’t going away anytime soon, and developers will need to continue to support and add new features and capabilities to these applications.

Any new technology must provide a path for people to migrate their existing systems to something new.

If you have decided to or are moving towards a microservice architecture, even if you have a monolith, implementing a service mesh should be among the possibilities explored. If you already have a microservice architecture that leverages gRPC or HTTP, and you're using Kubernetes then the benefits of a service mesh can be quickly realized. It's easy to sign up for our beta and install Aspen Mesh and the sample bookinfo application to see things in action. Once I did is when I became a true believer. Not being coupled with a particular cloud provider, but being flexible and able to choose where and how things are deployed empowers developers and companies to make their own choices.

Over the past month I’ve been able to quickly write application code and get it delivered faster than ever before; that is in large part due to the platform my colleagues have built on top of Kubernetes and Istio. I’ve been impressed by how easy a well built cloud-native architecture can make things, and learning more about where Aspen Mesh, Istio and Kubernetes are heading gives me confidence that community and adoption will continue to grow.

As someone that has dealt with distributed systems issues continuously throughout his career, I know managing and troubleshooting a distributed system can be exhausting. I just want to enable others, even Aspen Mesh as we dogfood our own software, to do their jobs. To enable developers to add value and solve difficult problems. To enable a company to monitor their systems, whether it be mission critical or a simple CRUD application, to help ensure high uptime and responsiveness. To enable systems to be easily auditable when the compliance personnel has GRDP, PCI DSS or HIPAA concerns. To enable developers to quickly diagnose issues within their own system, fix them and monitor the change. To enable developers to understand how their services are communicating with each other--if it’s an n-tier system or a spider’s web--and how requests propagate through their system.

The value of Istio and the benefits of Aspen Mesh in solving these challenges is what drew me here. The opportunities are abundant and fruitful. I get to program in go, in a SaaS environment and on a small team with a solid architecture. I am looking forward to becoming a part of the larger CNCF community. With microservices and cloud computing no longer being niche--which I’d argue hasn’t been the case for years--and with businesses adopting these new technology patterns quickly, I feel as if I made the right long-term career choice.


Top 3 Service Mesh Developments in 2019

Last year was about service mesh evaluation, trialing — and even hype.

While the interest in service mesh as a technology pattern was very high, it was mostly about evaluation and did not see widespread adoption. The capabilities service mesh can add to ease managing microservice-based applications at runtime are obvious, but the technology still needs to reach maturity before gaining widespread production adoption.

What we can say is service mesh adoption should evolve from the hype stage in a very real way this year.

What can we expect to see in 2019?

  1. The evolution and coalescing of service mesh as a technology pattern;
  2. The evolution of Istio as the way enterprises choose to implement service mesh;
  3. Clear uses cases that lead to wider adoption.

The Evolution of Service Mesh

There are several service mesh architectural options when it comes to service mesh, but undoubtedly, the sidecar architecture will see the most widespread usage in 2019. Sidecar proxy as the architectural pattern, and more specifically, Envoy as the technology, have emerged as clear winners for how the majority will implement service mesh.

Considering control plane service meshes, we have seen the space coalesce around leveraging sidecar proxies. Linkerd, with its merging of Conduit and release of Linkerd 2, got on the sidecar train. And the original sidecar control plane mesh, Istio, certainly has the most momentum in the cloud native space. A look at the Istio Github repo shows:

  • 14,500 stars;
  • 6,400 commits;
  • 300 contributors.

And if these numbers don’t clearly demonstrate the momentum of the project, just consider the number of companies building around Istio:

  • Aspen Mesh;
  • Avi Networks;
  • Cisco;
  • OpenShift;
  • NGINX;
  • Rancher;
  • Tufin Orca;
  • Tigera;
  • Twistlock;
  • VMware.

The Evolution of Istio

So the big question is where is the Istio project headed in 2019? I should start with the disclaimer that the following are all guesses. — they are well-informed guesses, but guesses nonetheless.

Community Growth

Now that Istio has hit 1.0, the number of contributors outside the core Google and IBM team are starting to grow. I’d hazard the guess that Istio will be truly stable around 1.3 sometime in June or July. Once the project gets to the point it is usable at scale in production, I think you’ll really see it take off.

Emerging Vendor Landscape

At Aspen Mesh, we hedged our bets on Istio 18 months ago. It seems to be becoming clear that Istio will win service mesh in much the same way Kubernetes has won container orchestration.

Istio is a powerful toolbox that directly addresses many microservices challenges that are being solved with multiple manual processes, or are not being solved at all. The power of the open source community surrounding it also seems to be a factor that will lead to widespread adoption. As this becomes clearer, the number of companies building on Istio and building Istio integrations will increase.

Istio Will Join the Cloud Native Computing Foundation

Total guess here, but I’d bet on this happening in 2019. CNCF has proven to be an effective steward of cloud-native open source projects. I think this will also be a key to widespread adoption which will be key to the long-term success of Istio. We shall see what the project founders decide, but this move will benefit everyone once the Istio project is at the point it makes sense for it to become a CNCF project.

Real-World Use Cases Are Key To Spreading Adoption

Service mesh is still a nascent market and in the next 12-24 months, we should see the market expand past just the early adopters. But for those who have been paying attention, the why of a service mesh has largely been answered. The whyis also certain to evolve, but for now, the reasons to implement a service mesh are clear. I think that large parts of the how are falling into place, but more will emerge as service mesh encounters real-world use cases in 2019.

I think what remains unanswered is “what are the real world benefits I am going to see when I put this into practice”? This is not a new question around an emerging technology. Neither will the way this question gets answered be anything new: and that will be through uses cases. I can’t emphasize enough how use cases based on actual users will be key.

Service mesh is a powerful toolbox, but only a small swath of users will care about how cool the tech is. The rest will want to know what problems it solves.

I predict 2019 will be the year of service mesh use cases that will naturally emerge as the number of adopters increases and begins to talk about the value they are getting with a service mesh.

Some Final Thoughts

If you are already using a service mesh, you understand the value it brings. If you’re considering a service mesh, pay close attention to this space and the number of uses cases will make the real world value proposition more clear. And if you’re not yet decided on whether or not you need a service mesh, check out the recent Gartner451 and IDC reports on microservices — all of which say a service mesh will be mandatory by 2020 for any organization running microservices in production.


Inline yaml Editing with yq

So you're working hard at building a solid Kubernetes cluster — maybe using kops to create a new instance group and BAM you are presented with an editor session to edit the details of that shiny new instance group. No biggie; you just need to add a simple little detailedInstanceMonitoring: true to the spec and you are good to go.

Okay, now you need to do this several times a day to test the performance of the latest build and this is just one of several steps to get the cluster up and running. You want to automate building that cluster as much as possible but every time you get to the step to create that instance group, BAM there it is again — your favorite editor, you have to add that same line every time.

Standard practice is to use cluster templating but there are times when you need something more lightweight. Enter yq.

yq is great for digging through yaml files but it also has an in-place merge function that can modify a file directly just like any editor. And kops, along with several other command line tools honor the EDITOR environment variable so you can automate your yaml editing along with the rest of your cluster handy work.

Making it work

The first roadblock is that you can pass command line options via the EDITOR environment variable but the file being edited in-place must be the last option (actually passed to the editor by kops as it invokes the editor). yq wants you to pass it the file to be edited followed by a patch file with instructions on editing the file (more on that below). To get around this issue I use a little bash script to invoke yq and reorder the last two command line options like so (I'll call the file yq-merge-editor.sh):

#!/usr/bin/env bash

if [[ $# != 2 ]]; then
    echo "Usage: $0 <merge file (supplied by script)> <file being edited (supplied by invoker of EDITOR)>"
    exit 1
fi

yq merge --inplace --overwrite $2 $1

In the above script, the merge option tells yq we want to merge yaml files and --inplace says to edit the first file in-place. The --overwrite option instructs yq to overwrite existing sections of the file if they are defined in the merge file. $2 is the file to be edited and $1 is the merge file (the opposite order of what the script gets them in). There are other useful options available documented in the yq merge documentation.

Example 1: Turning on detailed instance monitoring

The next step is to create a patch file containing the edit you want to perform. In this example, we will turn on detailed instance monitoring which is a useful way to get more metrics from your nodes. Here's the merge file (we will call this file ig-monitoring.yaml):

spec:
  detailedInstanceMonitoring: true

To put it all together, you can invoke kops with a custom editor command:

EDITOR="./yq-merge-editor.sh ./ig-monitoring.yaml" kops edit instancegroups nodes

That's it! kops creates a temporary file and invokes your editor script which invokes yq. yq edits the temporary file in-place and kops takes the edited output and moves on.

Example 2: Temporarily add nodes

Say you want to temporarily add capacity to your cluster while performing some maintenance. This is a temporary change, so there's no need to update your cluster's configuration permanently. The following patch file will update the min and max node counts in an instance group:

spec:
  maxSize: 25
  minSize: 25

Then invoke the same script from above followed by a kops update:

EDITOR="./yq-merge-editor.sh ig-nodes-25.yaml" kops edit instancegroups nodes
kops update cluster $NAME --yes

These tips should make it easier to build lots of happy clusters!


Using Kubernetes RBAC to Control Global Configuration In Istio

Why Configuration Matters

If I'm going to get an error for my code, I like to get an error as soon as possible.  Unit test failures are better than integration test failures. I prefer compiler errors to unit test failures - that's what makes TypeScript great.  Going even further, a syntax highlighter is a very proximate feedback of errors - if that keyword doesn't turn green, I can fix the spelling with almost no conscious burden.

Shifting from coding to configuration, I like narrow configuration systems that make it easy to specify a valid configuration and tell me as quickly as possible when I've done something wrong.  In fact, my dream configuration specification wouldn't allow me to specify something invalid. Remember Newspeak from 1984?  A language so narrow that expressing ungood thoughts become impossible.  Apparently, I like my programming and configuration languages to be dystopic and Orwellian.

If you can't further narrow the language of your configuration without giving away core functionality, the next best option is to narrow the scope of configuration.  I think about this in reverse - if I am observing some behavior from the system and say to myself, "Huh, that's weird", how much configuration do I have to look at before I understand?  It's great if this is a small and expanding ring - look at config that's very local to this particular object, then the next layer up (in Kubernetes, maybe the rest of the namespace), and so on to what is hopefully a very small set of global config.  Ever tried to debug a program with global variables all over the place? Not fun.

My three principles of ideal config:

  1. Narrow Like Newspeak, don't allow me to even think of invalid configuration.
  2. Scope Only let me affect a small set of things associated with my role.
  3. Time Tell me as early as possible if it's broken.

Declarative config is readily narrow.  The core philosophy of declarative config is saying what you want, not how you get it.  For example, "we'll meet at the Denver Zoo at noon" is declarative. If instead I specify driving directions to the Denver Zoo, I'm taking a much more imperative approach.  What if you want to bike there? What if there is road construction and a detour is required? The value of declarative config is that if we focus on what we want, instead of how to get it, it's easier for me to bring my controller (my car's GPS) and you to bring yours (Google Maps in the Bike setting).

On the other hand, a big part of configuration is pulling together a bunch of disparate pieces of information together at the last moment, from a bunch of different roles (think humans, other configuration systems and controllers) just before the system actually starts running.  Some amount of flexibility is required here.

Does Cloud Native Get Us To Better Configuration?

I think a key reason for the popularity of Kubernetes is that it has a great syntax for specifying what a healthy, running microservice looks like.  Its syntax is powerful enough in all the right places to be practical for infrastructure.

Service meshes like Istio robustly connect all the microservices running in your cluster.  They can adaptively route L7 traffic, provide end-to-end mTLS based encryption, and provide circuit breaking and fault injection.  The long feature list is great, but it's not surprising that the result is a somewhat complex set of configuration resources. It's a natural result of the need for powerful syntax to support disparate use cases coupled with rapid development.

Enabling Fine-grained RBAC with Traffic Claim Enforcer

At Aspen Mesh, we found users (including ourselves) spending too much time understanding misconfiguration.  The first way we addressed that problem was with Istio Vet, which is designed to warn you of probably incorrect or incomplete config, and provide guidance to fix it.  Sometimes we know enough that we can prevent the misconfiguration by refusing to allow it in the first place.  For some Istio config resources, we do that using a solution we call Traffic Claim Enforcer.

There are four Istio configuration resources that have global implications: VirtualService, Gateway, ServiceEntry and DestinationRule.  Whenever you create one of these resources, you create it in a particular namespace. They can affect how traffic flows through the service mesh to any target they specify, even if that target isn't in the current namespace.  This surfaces a scope anti-pattern - if I'm observing weird behavior for some service, I have to examine potentially all DestinationRules in the entire Kubernetes cluster to understand why.

That might work in the lab, but we found it to be a serious problem for applications running in production.  Not only is it hard to understand the current config state of the system, it's also easy to break. It’s important to have guardrails that make it so the worst thing I can mess up when deploying my tiny microservice is my tiny microservice.  I don't want the power to mess up anything else, thank you very much. I don't want sudo. My platform lead really doesn't want me to have sudo.

Traffic Claim Enforcer is an admission webhook that waits for a user to configure one of those resources with global implications, and before allowing will check:

  1. Does the resource have a narrow scope that affects only local things?
  2. Is there a TrafficClaim that grants the resource the broader scope requested?

A TrafficClaim is a new Kubernetes custom resource we defined that exists solely to narrow and define the scope of resources in a namespace.  Here are some examples:

kind: TrafficClaim
apiVersion: networking.aspenmesh.io/v1alpha3
metadata:
  name: allow-public
  namespace: cluster-public
claims:
# Anything on www.example.com
- hosts: [ "www.example.com" ]

# Only specific paths on foo.com, bar.com
- hosts: [ "foo.com", "bar.com" ]
  ports: [ 80, 443, 8080 ]
  http:
    paths:
      exact: [ "/admin/login" ]
      prefix: [ "/products" ]

# An external service controlled by ServiceEntries
- hosts: [ "my.external.com" ]
  ports: [ 80, 443, 8080, 8443 ]

TrafficClaims are controlled by Kubernetes Role-Based Access Control (RBAC).  Generally, the same roles or people that create namespaces and set up projects would also create TrafficClaims for those namespaces that need power to define service mesh traffic policy outside of their namespace scope.  Rule 1 about local scope above can be explained as "every namespace has an implied TrafficClaim for namespace-local traffic policy", to avoid requiring a boilerplate TrafficClaim.

A pattern we use is to put global config into a namespace like "istio-public" - that's the only place that needs TrafficClaims for things like public DNS names.  Or you might have a couple of namespaces like "istio-public-prod" and "istio-public-dev" or similar. It’s up to you.

Traffic Claim Enforcer does not prevent you from thinking of invalid config, but it does help to limit scope. If I'm trying to understand what happens when traffic goes to my microservice, I no longer have to examine every DestinationRule in the system.  I only have to examine the ones in my namespace, and maybe some others that have special TrafficClaims (and hopefully keep that list small).

Traffic Claim Enforcer also provides an early failure for config problems.  Without it, it is easy to create conflicting DestinationRules even in separate namespaces. This is a problem that Istio-Vet will tell you about, but cannot fix - it doesn't know which one should have priority. If you define TrafficClaims, then Traffic Claim Enforcer can prevent you from configuring it at all.

Hat tip to my colleague, Brian Marshall, who developed the initial public spec for TrafficClaims.  The Istio community is undertaking a great deal of work to scope/manage config aimed at improving system scalability.  We made Traffic Claim Enforcer with a focus on scoping to improve the human config experience as it was a need expressed by several of our users.  We're optimistic that the way Traffic Claim Enforcer helps with human concerns will complement the system scalability side of things.

If you want to give Traffic Claim Enforcer a spin, it's included as part of Aspen Mesh.  By default it doesn't enforce anything, so out-of-the-box it is compatible with Istio. You can turn it on globally or on a namespace-by-namespace basis.

Click play below to check it out in action!

https://youtu.be/47HzynDsD8w

 


Container orchestration

Going Beyond Container Orchestration

Every survey of late tells the same story about containers; organizations are not only adopting but embracing the technology. Most aren't relying on containers with the same degree of criticality as hyperscale organizations. That means they are one of the 85% of organizations IDC found in a Cisco-sponsored survey of over 8000 enterprises are using containers in production. That sounds impressive, but the scale at which they use them is limited. In a Forrester report commissioned by Dell EMC, Intel, and Red Hat, 63% of enterprises using containers have more than 100 instances running. 82% expect to be doing the same by 2019. That's a far cry from the hundreds of thousands in use by hyperscale technology companies.

And though the adoption rate is high, that's not to say that organizations haven't dabbled with containers only to abandon the effort. As with any (newish) technology, challenges exist. At the top of the list for containers are suspects you know and love: networking and management.

Some of the networking challenges are due to the functionality available in popular container orchestration environments like Kubernetes. Kubernetes supports microservices architectures through its service construct. This allows developers and operators to abstract the functionality of a set of pods and expose it as "a service" with access via a well-defined API. Kubernetes supports naming services as well as performing rudimentary layer 4 (TCP-based) load balancing.

The problem with layer 4 (TCP-based) load balancing is its inability to interact with layer 7 (application and API layers). This is true for any layer 4 load balancing; it's not something unique to containers and Kubernetes. Layer 4 offers visibility into connection level (TCP) protocols and metrics, but nothing more. That makes it difficult (impossible, really) to address higher-order problems such as layer 7 metrics like requests or transactions per second and the ability to split traffic (route requests) based on path. It also means you can't really do rate limiting at the API layer or support key capabilities like retries and circuit breaking.

The lack of these capabilities drives developers to encode them into each microservice instead. That results in operational code being included with business logic. This should cause some amount of discomfort, as it clearly violates the principles of microservice design. It's also expensive as it adds both architectural and technical debt to microservices.

Then there's management. While Kubernetes is especially adept at handling build and deploy challenges for containerized applications, it lacks key functionality needed to monitor and control microservice-based apps at runtime. Basic liveliness and health probes don't provide the granularity of metrics or the traceability needed for developers and operators to quickly and efficiently diagnose issues during execution. And getting developers to instrument microservices to generate consistent metrics can be a significant challenge, especially when time constraints are putting pressure on them to deliver customer-driven features.

These are two of the challenges a service mesh directly addresses: management and networking.

How Service Mesh Answers the Challenge

Both are more easily addressed by the implementation of a service mesh as a set of sidecar proxies. By plugging directly into the container environment, sidecar proxies enable transparent networking capabilities and consistent instrumentation. Because all traffic is effectively routed through the sidecar proxy, it can automatically generate and feed the metrics you need to the rest of the mesh. This is incredibly valuable for those organizations that are deploying traditional applications in a container environment. Legacy applications are unlikely to be instrumented for a modern environment. The use of a service mesh and its sidecar proxy basis enable those applications to emit the right metrics without requiring code to be added/modified.

It also means that you don't have to spend your time reconciling different metrics being generated by a variety of runtime agents. You can rely on one source of truth - the service mesh - to generate a consistent set of metrics across all applications and microservices.

Those metrics can include higher order data points that are fed into the mesh and enable more advanced networking to ensure fastest available responses to requests. Retry and circuit breaking is handled by the sidecar proxy in a service mesh, relieving the developer from the burden of introducing operational code into their microservices. Because the sidecar proxy is not constrained to layer 4 (TCP), it can support advanced message routing techniques that rely on access to layer 7 (application and API).

Container orchestration is a good foundation, but enterprise organizations need more than just a good foundation. They need the ability to interact with services at the upper layers of the stack, where metrics and modern architectural patterns are implemented today.

Both are best served by a service mesh. When you need to go beyond container orchestration, go service mesh.