Simplifying Microservices Security with Incremental mTLS

Kubernetes removes much of the complexity and difficulty involved in managing and operating a microservices application architecture. Out of the box, Kubernetes gives you advanced application lifecycle management techniques like rolling upgrades, resiliency via pod replication, auto-scalers and disruption budgets, efficient resource utilization with advanced scheduling strategies and health checks like readiness and liveness probes. Kubernetes also sets up basic networking capabilities which allow you to easily discover new services getting added to your cluster (via DNS) and enables pod to pod communication with basic load balancing.

However, most of the networking capabilities provided by Kubernetes and it’s CNI providers are constrained to layer 3/4 (networking/protocols like TCP/IP) of the OSI stack. This means that any advanced networking functionality (like retries or routing) which relies on higher layers i.e. parsing application protocols like HTTP/gRPC (layer 7) or encrypting traffic between pods using TLS (layer 5) has to be baked into the application. Relying on your applications to enforce network security is often fraught with landmines related to close coupling of your operations/security and development teams and at the same time adding more burden on your application developers to own complicated infrastructure code.

Let’s explore what it takes for applications to perform TLS encryption for all inbound and outbound traffic in a Kubernetes environment. In order to achieve TLS encryption, you need to establish trust between the parties involved in communication. For establishing trust, you need to create and maintain some sort of PKI infrastructure which can generate certificates, revoke them and periodically refresh them. As an operator, you now need a mechanism to provide these certificates (maybe use Kubernetes secrets?) to the running pods and update the pods when new certificates are minted. On the application side, you have to rely on OpenSSL (or its derivatives) to verify trust and encrypt traffic. The application developer team needs to handle upgrading these libraries when CVE fixes and upgrades are released. In addition to all these complexities, compliance concerns may also require you only support a TLS version (or higher) and subset of ciphers, which requires creating and supporting more configuration options in your applications. All of these challenges make it very hard for organizations to encrypt all pod network traffic on Kubernetes, whether it’s for compliance reasons or achieving a zero trust network model.

This is the problem that a service mesh leveraging the sidecar proxy approach is designed to solve. The sidecar proxy can initiate a TLS handshake and encrypt traffic without requiring any changes or support from the applications. In this architecture, the application pod makes a request in plain text to another application running in the Kubernetes cluster which the sidecar proxy takes over and transparently upgrades to use mutual TLS. Additionally, the Istio control plane component Citadel handles creating workload identities using the SPIFFE specification to create and renew certificates and mount the appropriate certificates to the sidecars. This removes the burden of encrypting traffic from developers and operators.

Istio provides a rich set of tools to configure mutual TLS globally (on or off) for the entire cluster or incrementally enabling mTLS for namespaces or a subset of services and its clients and incrementally adopting mTLS. This is where things get a little complicated. In order to correctly configure mTLS for one service, you need to configure an Authentication Policy for that service and the corresponding DestinationRules for its clients.

Both the Authentication policy and Destination rule follow a complex set of precedence rules which must be accounted for when creating these configuration objects. For example, a namespace level Authentication policy overrides the mesh level global policy, a service level policy overrides the namespace level and a service port level policy overrides the service specific Authentication policy. Destination rules allow you to specify the client side configuration based on host names where the highest precedence is the Destination rule defined in the client namespace then the server namespace and finally the global default Destination rule. On top of that, if you have conflicting Authentication policies or Destination rules, the system behavior can be indeterminate. A mismatch in Authentication policy and Destination rule can lead to subtle traffic failures which are difficult to debug and diagnose. Aspen Mesh makes it easy to understand mTLS status and avoid any configuration errors.

Editing these complex configuration files in YAML can be tricky and only compound the problem at hand. In order to simplify how you configure these resources and incrementally adopt mutual TLS in your environment, we are releasing a new feature which enables our customers to specify a service port (via APIs or UI) and their desired mTLS state (enabled or disabled). The Aspen Mesh platform automatically generates the correct set of configurations needed (Authentication policy and/or Destination rules) by inspecting the current state and configuration of your cluster. You can then view the generated YAMLs, edit as needed and store them in your CI system or apply them manually as needed. This feature removes the hassle of learning complex Istio resources and their interaction patterns, and provides you with valid, non-conflicting and functional Istio configuration.

Customers that we talk to are in various stages of migrating to a microservices architecture or Kubernetes environment which results in a hybrid environment where you have services which are consumed by clients not in the mesh or are deployed outside the Kubernetes environment, so some services require a different mTLS policy. Our hosted dashboard makes it easy for users to identify services and workloads which have mTLS turned on or off and then easily create configuration using the above workflow to change the mTLS state as needed.

If you’re an existing customer, please upgrade your cluster to our latest release (Aspen Mesh 1.1.3-am2) and login to the dashboard to start using the new capabilities.

If you’re interested in learning about Aspen Mesh and incrementally adopting mTLS in your cluster, you can sign up for a beta account here.


Securing Containerized Applications With Service Mesh

The self-contained, ephemeral nature of microservices comes with some serious upside, but keeping track of every single one is a challenge, especially when trying to figure out how the rest are affected when a single microservice goes down. The end result is that if you’re operating or developing a microservices architecture, there’s a good chance part of your days are spent wondering what your services are up to.

With the adoption of microservices, problems also emerge due to the sheer number of services that exist in large systems. Problems like security, load balancing, monitoring and rate limiting that had to be solved once for a monolith, now have to be handled separately for each service.

The technology aimed at addressing these microservice challenges has been  rapidly evolving:

  1. Containers facilitate the shift from monolith to microservices by enabling independence between applications and infrastructure.
  2. Container orchestration tools solve microservices build and deploy issues, but leave many unsolved runtime challenges.
  3. Service mesh addresses runtime issues including service discovery, load balancing, routing and observability.

Securing services with a service mesh

A service mesh provides an advanced toolbox that lets users add security, stability and resiliency to containerized applications. One of the more common applications of a service mesh is bolstering cluster security. There are 3 distinct capabilities provided by the mesh that enable platform owners to create a more secure architecture.

Traffic Encryption  

As a platform operator, I need to provide encryption between services in the mesh. I want to leverage mTLS to encrypt traffic between services. I want the mesh to automatically encrypt and decrypt requests and responses, so I can remove that burden from my application developers. I also want it to improve performance by prioritizing the reuse of existing connections, reducing the need for the computationally expensive creation of new ones. I also want to be able to understand and enforce how services are communicating and prove it cryptographically.

Security at the Edge

As a platform operator, I want Aspen Mesh to add a layer of security at the perimeter of my clusters so I can monitor and address compromising traffic as it enters the mesh. I can use the built in power of Kubernetes as an ingress controller to add security with ingress rules such as whitelisting and blacklisting. I can also apply service mesh route rules to manage compromising traffic at the edge. I also want control over egress so I can dictate that our network traffic does not go places it shouldn't (blacklist by default and only talk to what you whitelist).

Role Based Access Control (RBAC)

As the platform operator, It’s important that I am able to provide the level of least privilege so the developers on my platform only have access to what they need, and nothing more. I want to enable controls so app developers can write policy for their apps and only their apps so that they can move quickly without impacting other teams. I want to use the same RBAC framework that I am familiar with to provide fine-grained RBAC within my service mesh.

How a service mesh adds security

You’re probably thinking to yourself, traffic encryption and fine-grained RBAC sound great, but how does a service mesh actually get me to them? Service meshes that leverage a sidecar approach are uniquely positioned intercept and encrypt data. A sidecar proxy is a prime insertion point to ensure that every service in a cluster is secured, and being monitored in real-time. Let’s explore some details around why sidecars are a great place for security.

Sidecar is a great place for security

Securing applications and infrastructure has always been daunting, in part because the adage really is true: you are only as secure as your weakest link.  Microservices are an opportunity to improve your security posture but can also cut the other way, presenting challenges around consistency.  For example, the best organizations use the principle of least privilege: an app should only have the minimum amount of permissions and privilege it needs to get its job done.  That's easier to apply where a small, single-purpose microservice has clear and narrowly-scoped API contracts.  But there's a risk that as application count increases (lots of smaller apps), this principle can be unevenly applied. Microservices, when managed properly, increase feature velocity and enable security teams to fulfill their charter without becoming the Department of No.

There's tension: Move fast, but don't let security coverage slip through the cracks.  Prefer many smaller things to one big monolith, but secure each and every one.  Let each team pick the language of their choice, but protect them with a consistent security policy.  Encourage app teams to debug, observe and maintain their own apps but encrypt all service-to-service communication.

A sidecar is a great way to balance these tensions with an architecturally sound security posture.  Sidecar-based service meshes like Istio and Linkerd 2.0 put their datapath functionality into a separate container and then situate that container as close to the application they are protecting as possible.  In Kubernetes, the sidecar container and the application container live in the same Kubernetes Pod, so the communication path between sidecar and app is protected inside the pod's network namespace; by default it isn't visible to the host or other network namespaces on the system.  The app, the sidecar and the operating system kernel are involved in communication over this path.  Compared to putting the security functionality in a library, using a sidecar adds the surface area of kernel loopback networking inside of a namespace, instead of just kernel memory management.  This is additional surface area, but not much.

The major drawbacks of library approaches are consistency and sprawl in polyglot environments.  If you have a few different languages or application frameworks and take the library approach, you have to secure each one.  This is not impossible, but it's a lot of work.  For each different language or framework, you get or choose a TLS implementation (perhaps choosing between OpenSSL and BoringSSL).  You need a configuration layer to load certificates and keys from somewhere and safely pass them down to the TLS implementation.  You need to reload these certs and rotate them.  You need to evaluate "information leakage" paths: does your config parser log errors in plaintext (so it by default might print the TLS key to the logs)?  Is it OK for app core dumps to contain these keys?  How often does your organization require re-keying on a connection?  By bytes or time or both?  Minimum cipher strength?  When a CVE in OpenSSL comes out, what apps are using that version and need updating?  Who on each app team is responsible for updating OpenSSL, and how quickly can they do it?  How many apps have a certificate chain built into them for consuming public websites even if they are internal-only?  How many Dockerfiles will you need to update the next time a public signing authority has to revoke one?  slowloris?

Your organization can do all this work.  In fact, parts probably already have - above is our list of painful app security experiences but you probably have your own additions.  It is a lot of cross-organizational effort and process to get it right.  And you have to get it right everywhere, or your weakest link will be exploited.  Now with microservices, you have even more places to get it right.  Instead, our advice is to focus on getting it right once in the sidecar, and then distributing the sidecar everywhere, and get back to adding business value instead of duplicating effort.

There are some interesting developments on the horizon like the use of kernel TLS to defer bulk and some asymmetric crypto operations to the kernel.  That's great:  Implementations should change and evolve.  The first step is providing a good abstraction so that apps can delegate to lower layers. Once that's solid, it's straightforward to move functionality from one layer to the next as needed by use case, because you don't perturb the app any more.  As precedent, consider TCP Segmentation Offload, which lets the network card manage splitting app data into the correct size for each individual packet.  This task isn't impossible for an app to do, but it turns out to be wasted effort.  By deferring TCP segmentation to the kernel, it left the realm of the app.  Then, kernels, network drivers, and network cards were free to focus on the interoperability and semantics required to perform TCP segmentation at the right place.  That's our position for this higher-level service-to-service communication security: move it outside of the app to the sidecar, and then let sidecars, platforms, kernels and networking hardware iterate.

Envoy Is a Great Sidecar

We use Envoy as our sidecar because it's lightweight, has some great features and good API-based configurability.  Here are some of our favorite parts about Envoy:

  • Configurable TLS Parameters: Envoy exposes all the TLS configuration points you'd expect (cipher strength, protocol versions, curves).  The advantage to using Envoy is that they're configured the same way for every app using the sidecar.
  • Mutual TLS: Typically TLS is used to authenticate the server to the client, and to encrypt communication.  What's missing is authenticating the client to the server - if you do this, then the server knows what is talking to it.  Envoy supports this bi-directional authentication out of the box, which can easily be incorporated into a SPIFFE system.  In today's complex and cloud datacenter, you're better off if you trust things based on cryptographic proof of what they are, instead of network perimeter protection of where they called from.
  • BoringSSL: This fork of OpenSSL removed huge amounts of code like implementations of obsolete ciphers and cleaned up lots of vestigial implementation details that had repeatedly been the source of security vulnerabilities.  It's a good default choice if you don't need any OpenSSL-specific functionality because it's easier to get right.
  • Security Audit: A security audit can't prove the absence of vulnerabilities but it can catch mistakes that demonstrate either architectural weaknesses or implementation sloppiness.  Envoy's security audit did find issues but in our opinion indicated a high level of security health.
  • Fuzzed and Bountied: Envoy is continuously fuzzed (exposed to malformed input to see if it crashes) and covered by Google's Patch Reward security bug bounty program.
  • Good API Granularity: API-based configuration doesn't mean "just serialize/deserialize your internal state and go."  Careful APIs thoughtfully map to the "personas" of what's operating them (even if those personas are other programs).  Envoy's xDS APIs in our experience partition routing behavior from cluster membership from secrets.  This makes it easy to make well-partitioned controllers.  A knock-on benefit is that it is easy in our experience to debug and test Envoy because config constructs usually map pretty clearly to code constructs.
  • No garbage collector: There are great languages with automatic memory management like Go that we use every day.  But we find languages like C++ and Rust provide predictable and optimizable tail latency.
  • Native Extensibility via Filters: Envoy has layer 4 and layer 7 extension points via filters that are written in C++ and linked into Envoy.
  • Scripting Extensibility via Lua: You can write Lua scripts as extension points as well.  This is very convenient for rapid prototyping and debugging.

One of these benefits deserves an even deeper dive in a security-oriented discussion.  The API granularity of Envoy is based on a scheme called "xDS" which we think of as follows:  Logically split the Envoy config API based on the user of that API.  The user in this case is almost always some other program (not a human), for instance a Service Mesh control plane element.

For instance, in xDS listeners ("How should I get requests from users?") are separated from clusters ("What pods or servers are available to handle requests to the shoppingcart service?").  The "x" in "xDS" is replaced with whatever functionality is implemented ("LDS" for listener discovery service).  Our favorite security-related partitioning is that the Secret Discovery Service can be used for propagating secrets to the sidecars independent of the other xDS APIs.

Because SDS is separate, the control plane can implement the Principle of Least Privilege: nothing outside of SDS needs to handle or have access to any private key material.

Mutual TLS is a great enhancement to your security posture in a microservices environment.  We see mutual TLS adoption as gradual - almost any real-world app will have some containerized microservices ready to join the service mesh and mTLS on day one.  But practically speaking, many of these will depend on mesh-external services, containerized or not.  It is possible in most cases to integrate these services into the same trust domain as the service mesh, and oftentimes these components can even participate in client TLS authentication so you get true mutual TLS.

In our experience, this happens by gradually expanding the "circle" of things protected with mutual TLS.  First, stateless containerized business logic, next in-cluster third party services, finally external state stores like bare metal databases.  That's why we focus on making the state of mTLS easy to understand in Aspen Mesh, and provide assistants to help you detect configuration mishaps.

What lives outside the sidecar?

You need a control plane to configure all of these sidecars.  In some simple cases it may be tempting to do this with some CI integration to generate configs plus DNS-based discovery.  This is viable but it's hard to do rapid certificate rotation.  Also, it leaves out more dynamic techniques like canaries, progressive delivery and A/B testing.  For this reason, we think most real-world applications will include an online control plane that should:

  • Disseminate configuration to each of the sidecars with a scalable approach.
  • Rotate sidecar certificates rapidly to reduce the value to an attacker of a one-time exploit of an application.
  • Collect metadata on what is communicating with what.

A good security posture means you should be automating some work on top of the control plane. We think these things are important (and built them into Aspen Mesh):

  • Organizing information to help humans narrow in on problems quickly.
  • Warning on potential misconfigurations.
  • Alerting when unhealthy communication is observed.
  • Inspect the firehose of metadata for surprises - these patterns could be application bugs or security issues or both.

If you’re considering or going down the Kubernetes path, you should be thinking about the unique security challenges that comes with microservices running in a Kubernetes cluster. Kubernetes solves many of these, but there are some critical runtime issues that a service mesh can make easier and more secure. If you would like to talk about how the Aspen Mesh platform and team can address your specific security challenge, feel free to find some time to chat with us.


The Complete Guide to Service Mesh

What’s Going On In The Service Mesh Universe?

Service meshes are relatively new, extremely powerful and can be complex. There’s a lot of information out there on what a service mesh is and what it can do, but it’s a lot to sort through. Sometimes, it’s helpful to have a guide. If you’ve been asking questions like “What is a service mesh?” “Why would I use one?” “What benefits can it provide?” or “How did people even come up with the idea for service mesh?” then The Complete Guide to Service Mesh is for you.

Check out the free guide to find out:

  • The service mesh origin story
  • What a service mesh is
  • Why developers and operators love service mesh
  • How a service mesh enables DevOps
  • Problems a service mesh solves

The Landscape Right Now

A service mesh overlaps, complements, and in some cases, replaces many tools that are commonly used to manage microservices. Last year was all about evaluating and trying out service meshes. But while curiosity about service mesh is still at a peak, enterprises are already in the evaluation and adoption process.

The capabilities service mesh can add to ease managing microservices applications at runtime are clearly exciting to early adopters and companies evaluating service mesh. Conversations tell us that many enterprises are already using microservices and service mesh, and many others are planning to deploy in the next six months. And if you’re not yet sure about whether or not you need a service mesh, check out the recent Gartner, 451 and IDC reports on microservices — all of which say a service mesh will be mandatory by 2020 for any organization running microservices in production.

Get Started with Service Mesh

Are you already using Kubernetes and Istio? You might be ready to get started using a service mesh. Download Aspen Mesh here or contact us to talk with a service mesh expert about getting set up for success.

Get the Guide

Fill out the form below to get your copy of The Complete Guide to Service Mesh.