Understanding Service Mesh

The Origin of Service Mesh

In the beginning, we had packets and packet-switched networks.

Everyone on the Internet — all 30 of them — used packets to build addressing, session establishment/teardown. And then, they’d need a retransmission scheme. Then, they’d build an ordered byte stream out of it.

Eventually, they realized they had all built the same thing. The RFCs for IP and TCP standardized this, operating systems provided a TCP/IP stack, so no application ever had to turn a best-effort packet network into a reliable byte stream.

We took our reliable byte streams and used them to make applications. Turns out that a lot of those applications had common patterns again — they requested things from servers, and then got responses. So, we separated these request/responses into metadata (headers) and body.

HTTP standardized the most widely deployed request/response protocol. Same story. App developers don't have to implement the mechanics of requests and responses. They can focus on the app on top.

There's a newer set of functionality that you need to build a reliable microservices application. Service discovery, versioning, zero-trust.... all the stuff popularized by the Netflix architecture, by 12-factor apps, etc. We see the same thing happening again - an emerging set of best practices that you have to build into each microservice to be successful.

So, service mesh is about putting all that functionality again into a layer, just like HTTP, TCP, packets, that's underneath your code, but creating a network for services rather than bytes.

Questions? Download The Complete Guide to Service Mesh or keep reading to find out more about what exactly a service mesh is.

What Is A Service Mesh?

A service mesh is a transparent infrastructure layer that sits between your network and application.

It’s designed to handle a high volume of service-to-service communications using application programming interfaces (APIs). A service mesh ensures that communication among containerized application services is fast, reliable and secure.

The mesh provides critical capabilities including service discovery, load balancing, encryption, observability, traceability, authentication and authorization, and the ability to control policy and configuration in your Kubernetes clusters.

Service mesh helps address many of the challenges that arise when your application is being consumed by the end user. Being able to monitor what services are communicating with each other, if those communications are secure, and being able to control the service-to-service communication in your clusters are key to ensuring applications are running securely and resiliently.

More Efficiently Managing Microservices

The self-contained, ephemeral nature of microservices comes with some serious upside, but keeping track of every single one is a challenge — especially when trying to figure out how the rest are affected when a single microservice goes down. The end result is that if you’re operating or developing in a microservices architecture, there’s a good chance part of your days are spent wondering what the hell your services are up to.

With the adoption of microservices, problems also emerge due to the sheer number of services that exist in large systems. Problems like security, load balancing, monitoring and rate limiting that had to be solved once for a monolith, now have to be handled separately for each service.

Service mesh helps address many of these challenges so engineering teams, and businesses, can deliver applications more quickly and securely.

Why You Might Care

If you’re reading this, you’re probably responsible for making sure that you and your end users get the most out of your applications and services. In order to do that, you need to have the right kind of access, security and support. That’s probably why you started down the microservices path.

If that’s true, then you’ve probably realized that microservices come with their own unique challenges, such as:

  1. Increased surface area that can be attacked
  2. Polyglot challenges
  3. Controlling access for distributed teams developing on a single application

That’s where a service mesh comes in.

A service mesh is an infrastructure layer for microservices applications that can help reduce the complexity of managing microservices and deployments by handling infrastructure service communication quickly, securely and reliably. 

Service meshes are great at solving operational challenges and issues when running containers and microservices because they provide a uniform way to secure, connect and monitor microservices. 

Here’s the point: a good service mesh keeps your company’s services running they way they should. A service mesh designed for the enterprise, like Aspen Mesh, gives you all the observability, security and traffic management you need — plus access to engineering and support, so you can focus on adding the most value to your business.

And that is good news for DevOps.

The Rise of DevOps - and How Service Mesh Is Enabling It

It’s happening, and it’s happening fast.

Companies are transforming internal orgs and product architectures along a new axis of performance. They’re finding more value in iterations, efficiency and incremental scaling, forcing them to adopt DevOps methodologies. This focus on time-to-market is driving some of the most cutting-edge infrastructure technology that we have ever seen. Technologies like containers and Kubernetes, and a focus on stable, consistent and open APIs allow small teams to make amazing progress and move at the speeds they require. These technologies have reduced the friction and time to market.

The adoption of these technologies isn’t perfect, and as companies deploy them at scale, they realize that they have inadvertently increased complexity and de-centralized ownership and control. In many cases, it’s challenging to understand the entire system.

A service mesh enables DevOps teams by helping manage this complexity. It provides autonomy and freedom for development teams through a stable and scalable platform, while simultaneously providing a way for platform teams to enforce security, policy and compliance standards.

This empowers your development teams to make choices based on the problems they are solving rather than being concerned with the underlying infrastructure. Dev teams now have the freedom to deploy code without the fear of violating compliance or regulatory guidelines, and platform teams can put guardrails in place to ensure your applications are secure and resilient.

Want to learn more? Get the Complete Guide to Service Mesh here.

Important Security Updates in Aspen Mesh 1.1.13

Aspen Mesh is announcing the release of 1.1.13 which addresses important Istio security updates.  Below are the details of the security fixes taken from Istio 1.1.13 security update.

ISTIO-SECURITY-2019-003: An Envoy user reported publicly an issue (c.f. Envoy Issue 7728) about regular expressions matching that crashes Envoy with very large URIs.

  • CVE-2019-14993: After investigation, the Istio team has found that this issue could be leveraged for a DoS attack in Istio, if users are employing regular expressions in some of the Istio APIs: JWT, VirtualService, HTTPAPISpecBinding, QuotaSpecBinding .

ISTIO-SECURITY-2019-004: Envoy, and subsequently Istio are vulnerable to a series of trivial HTTP/2-based DoS attacks:

  • CVE-2019-9512: HTTP/2 flood using PING frames and queuing of response PING ACK frames that results in unbounded memory growth (which can lead to out of memory conditions).
  • CVE-2019-9513: HTTP/2 flood using PRIORITY frames that results in excessive CPU usage and starvation of other clients.
  • CVE-2019-9514: HTTP/2 flood using HEADERS frames with invalid HTTP headers and queuing of response RST_STREAM frames that results in unbounded memory growth (which can lead to out of memory conditions).
  • CVE-2019-9515: HTTP/2 flood using SETTINGS frames and queuing of SETTINGS ACK frames that results in unbounded memory growth (which can lead to out of memory conditions).
  • CVE-2019-9518: HTTP/2 flood using frames with an empty payload that results in excessive CPU usage and starvation of other clients.
  • See this security bulletin for more information

The 1.1.13 binaries are available for download here 

Upgrading procedures of Aspen Mesh deployments installed via Helm (helm install) please visit our Getting Started page. 


Why Is Policy Hard?

Aspen Mesh spends a lot of time talking to users about policy, even if we don’t always start out calling it that. A common pattern we see with clients is:

  1. Concept: "Maybe I want this service mesh thing"
  2. Install: "Ok, I've got Aspen Mesh installed, now what?"
  3. Observe: "Ahhh! Now I see how my microservices are communicating.  Hmmm, what's that? That pod shouldn't be talking to that database!"
  4. Act: "Hey mesh, make sure that pod never talks to that database"

The Act phase is interesting, and there’s more to it than might be obvious at first glance. I'll propose that in this blog, we work through some thought experiments to delve into how service mesh can help you act on insights from the mesh.

First, put yourself in the shoes of the developer that just found out their test pod is accidentally talking to the staging database. (Ok, you're working from home today so you don't have to put on shoes; the cat likes sleeping on your shoeless feet better anyways.) You want to control the behavior of a narrow set of software for which you're the expert; you have local scope and focus.

Next, put on the shoes of a person responsible for operating many applications; people we talk to often have titles that include Platform, SRE, Ops, Infra. Each day they’re diving into different applications so being able to rapidly understand applications is key. A consistent way of mapping across applications, datacenters, clouds, etc. is critical. Your goal is to reduce "snowflake architecture" in favor of familiarity to make it easier when you do have to context switch.

Now let's change into the shoes of your org's Compliance Officer. You’re on the line for documenting and proving that your platform is continually meeting compliance standards. You don't want to be the head of the “Department of No”, but what’s most important to you is staying out of the headlines. A great day at work for you is when you've got clarity on what's going on across lots of apps, databases, external partners, every source of data your org touches AND you can make educated tradeoffs to help the business move fast with the right risk profile. You know it’s ridiculous to be involved in every app change, so you need separation-of-concerns.

I'd argue that all of these people have policy concerns. They want to be able to specify their goals at a suitably high level and leave the rote and repetitive portions to an automated system.  The challenging part is there's only one underlying system ("the kubernetes cluster") that has to respond to each of these disparate personas.

So, to me policy is about transforming a bunch of high-level behavioral prescriptions into much lower-level versions through progressive stages. Useful real-world policy systems do this in a way that is transparent and understandable to all users, and minimizes the time humans spend coordinating. Here's an example "day-in-the-life" of a policy:

At the top is the highest level goal: "Devs should test new code without fear". Computers are hopeless to implement this. At the bottom is a rule suitable for a computer like a firewall to implement.

The layers in the middle are where a bad policy framework can really hurt. Some personas (the hypothetical Devs) want to instantly jump to the bottom - they're the "" in the above example. Other personas (the hypothetical Compliance Officer) is way up top, going down a few layers but not getting to the bottom on a day-to-day basis.

I think the best policy frameworks help each persona:

  • Quickly find the details for the layer they care about right now.
  • Help them understand where did this come from? (connect to higher layers)
  • Help them understand is this doing what I want? (trace to lower layers)
  • Know where do I go to change this? (edit/create policy)

As an example, let's look at iptables, one of the firewalling/packet mangling frameworks for Linux.  This is at that bottom layer in my example stack - very low-level packet processing that I might look at if I'm an app developer and my app's traffic isn't doing what I'd expect.  Here's an example dump:

root@kafka-0:/# iptables -n -L -v --line-numbers -t nat
Chain PREROUTING (policy ACCEPT 594K packets, 36M bytes)
num   pkts bytes target     prot opt in out   source destination
1     594K 36M ISTIO_INBOUND  tcp -- * *  

Chain INPUT (policy ACCEPT 594K packets, 36M bytes)
num   pkts bytes target     prot opt in out   source destination

Chain OUTPUT (policy ACCEPT 125K packets, 7724K bytes)
num   pkts bytes target     prot opt in out   source destination
1      12M 715M ISTIO_OUTPUT  tcp -- * *  

Chain POSTROUTING (policy ACCEPT 12M packets, 715M bytes)
num   pkts bytes target     prot opt in out   source destination

Chain ISTIO_INBOUND (1 references)
num   pkts bytes target     prot opt in out   source destination
1        0 0 RETURN     tcp -- * *            tcp dpt:22
2     594K 36M RETURN     tcp -- * *            tcp dpt:15020
3        2 120 ISTIO_IN_REDIRECT  tcp -- * *  

Chain ISTIO_IN_REDIRECT (1 references)
num   pkts bytes target     prot opt in out   source destination
1        2 120 REDIRECT   tcp -- * *            redir ports 15006

Chain ISTIO_OUTPUT (1 references)
num   pkts bytes target     prot opt in out   source destination
1      12M 708M ISTIO_REDIRECT  all -- * lo           !
2        7 420 RETURN     all -- * *            owner UID match 1337
3        0 0 RETURN     all -- * *            owner GID match 1337
4     119K 7122K RETURN     all -- * *
5        4 240 ISTIO_REDIRECT  all -- * *  

Chain ISTIO_REDIRECT (2 references)
num   pkts bytes target     prot opt in out   source destination
1      12M 708M REDIRECT   tcp -- * *            redir ports 15001

This allows me to quickly understand a lot of details about what is happening at this layer. Each rule specification is on the right-hand side and is relatively intelligible to the personas that operate at this layer. On the left, I get "pkts" and "bytes" - this is a count of how many packets have triggered each rule, helping me answer "Is this doing what I want it to?". There's even more information here if I'm really struggling: I can log the individual packets that are triggering a rule, or mark them in a way that I can capture them with tcpdump.  

Finally, furthest on the left in the "num" column is a line number, which is necessary if I want to modify or delete rules or add new ones before/after; this is a little bit of help for "Where do I go to change this?". I say a little bit because in most systems that I'm familiar with, including the one I grabbed that dump from, iptables rules are produced by some program or higher-level system; they're not written by a human. So if I just added a rule, it would only apply until that higher-level system intervened and changed the rules (in my case, until a new Pod was created, which can happen at any time). I need help navigating up a few layers to find the right place to effect the change.

iptables lets you organize groups of rules into your own chains, in this case the name of the chain (ISTIO_***) is a hint that Istio produced this and so I've got a hint on what higher layer to examine.

For a much different example, how about the Kubernetes CI Robot (from Prow)? If you've ever made a PR to Kubernetes or many other CNCF projects, you likely interacted with this robot. It's an implementer of policy; in this case the policies around changing source code for Kubernetes.  One of the policies it manages is compliance with the Contributor's License Agreement; contributors agree to grant some Intellectual Property rights surrounding their contributions. If k8s-ci-robot can't confirm that everything is alright, it will add a comment to your PR:

This is much different than firewall policy, but I say it's still policy and I think the same principles apply. Let's explore. If you had to diagram the policy around this, it would start at the top with the legal principle that Kubernetes wants to make sure all the software under its umbrella has free and clear IP terms. Stepping down a layer, the Kubernetes project decided to satisfy that requirement by requiring a CLA for any contributions. So on until we get to the bottom layer, the code that implements the CLA check.

As an aside, the code that implements the CLA check is actually split into two halves: first there's a CI job that actually checks the commits in the PR against a database of signed CLAs, and then there's code that takes the result of that job and posts helpful information for users to resolve any issues. That's not visible or important at that top layer of abstraction (the CNCF lawyers shouldn't care).

This policy structure is easy to navigate. If your CLA check fails, the comment from the robot has great links. If you're an individual contributor you can likely skip up a layer, sign the CLA and move on. If you're contributing on behalf of a company, the links will take you to the document you need to send to your company's lawyers so they can sign on behalf of the company.

So those are two examples of policy. You probably encounter many other ones every day from corporate travel policy to policies (written, unwritten or communicated via email missives) around dirty dishes.

It's easy to focus on the technical capabilities of the lowest levels of your system. But I'd recommend that you don't lose focus on the operability of your system. It’s important that it be transparent and easy to understand. Both the iptables and k8s-ci-robot are transparent. The k8s-ci-robot has an additional feature: it knows you're probably wondering "Where did this come from?" and it answers that question for you. This helps you and your organization navigate the layers of policy. 

When implementing service mesh to add observability, resilience and security to your Kubernetes clusters, it’s important to consider how to set up policy in a way that can be navigated by your entire team. With that end in mind, Aspen Mesh is building a policy framework for Istio that makes it easy to implement policy and understand how it will affect application behavior.

Did you like this blog? Subscribe to get email updates when new Aspen Mesh blogs go live.

Microservice Security and Compliance in Highly Regulated Industries: Zero Trust Security

Zero Trust Security

Security is the most critical part of your application to implement correctly. Failing to secure your users’ data can be very expensive and can make customers lose their faith in your ability to protect their sensitive data. A recent IBM-sponsored study showed that the average cost of a data breach is $3.92 million, with healthcare being the most expensive industry with an average of $6.45 million per breach. What else might be surprising is that the average time to identify and contain a breach is 279 days, while the average lifecycle of a malicious attack from breach to containment is 314 days. 

Traditionally network security has been based on having a strong perimeter to help thwart attackers, commonly known as the moat-and-castle approach. This approach is no longer effective in a world where employees expect access to applications and data from anywhere in the world, on any device. This shift is forcing organizations to evolve at a rapid rate to stay competitive in the market, and has left many engineering teams scrambling to keep up with employee expectations. Often this means rearchitecting systems and services to meet these expectations, which is often difficult, time consuming, and error prone.   

In 2010 Forester coined the term ‘Zero Trust’ where they flipped the current security models on their heads by changing how we think about cyberthreats. The new model is to assume you’ve been compromised, but may not yet be aware of it. A couple years later, Google announced they had implemented Zero Trust into their networking infrastructure with much success. Fast forward to 2019 and the plans to adopt this new paradigm have spread across industries like wildfire, mostly due to the massive data breaches and stricter regulatory requirements.

Here are the key Zero Trust Networking Principles:

  • Networks should always be considered hostile. 
    • Just because you’re inside the “castle” does not make you safe.
  • Network locality is not sufficient for deciding trust in a network.
    • Just because you know someone next to you in the “castle”, doesn’t mean you should trust them.
  • Every device, user, and request is authenticated and authorized.
    • Ensure that every person entering the “castle” has been properly identified and is allowed to enter.
  • Network policies must be dynamic and calculated from as many sources of data as possible. 
    • Ask as many people as possible when validating if someone is allowed to enter the “castle”.

Transitioning to Zero Trust Networking can dramatically increase your security posture, but until recent years, it has been a time consuming and difficult task that required extensive security knowledge within engineering teams, sophisticated internal tooling that could manage workload certificates, and service level authentication and authorization. Thankfully service mesh technologies, such as Istio, allow us to easily implement Zero Trust Networking across our microservices and clusters with little effort, minimal service disruption, and does not require your team to be security experts. 

Zero Trust Networking With Istio

Istio provides the following features that help us implement Zero Trust Networking in our infrastructure:

  • Service Identities
  • Mutual Transport Layer Security (mTLS)
  • Role Based Access Control (RBAC) 
  • Network Policy

Service Identities

One of the key Zero Trust Networking principles requires that “every device, user, and request is authenticated and authorized”. Istio implements this key foundational principle by issuing secure identities to services, much like how application users are issued an identity. This is often referred to as the SVID (Secure and Verifiable Identification) and is used to identify the services across the mesh, so they can be authenticated and authorized to perform actions. Service identities can take different forms based on the platform Istio is deployed on, for example:

  • When deployed on:
    • Kubernetes: Istio can use Kubernetes service accounts.
    • Amazon Web Services (AWS): Istio can use AWS IAM user and role accounts.
    • Google Kubernetes Engine (GKE): Istio can use Google Cloud Platform (GCP) service accounts.

Mutual Transport Layer Security (mTLS)

To support secure Service Identities and to secure data in transit, Istio provides mTLS for encrypting service-to-service communication and achieving non-repudiation for requests. This layer of security reduces the likelihood of a successful Man-in-The-Middle attack (MiTM) by requiring all parties in a request to have valid certificates that trust each other. The process for certificate generation, distribution, and rotation is automatically handled by a secure Istio service called Citadel.  

Role Based Access Control (RBAC)

Authorization is a critical part of any secure system and is required for a successful Zero Trust Networking implementation. Istio provides flexible and highly performant RBAC via centralized policy management, so you can easily define what services are allowed to communicate and what endpoints services and users are allowed to communicate with. This makes the implementation of the principle of least privilege (PoLP) simple and reduces the development teams’ burden of creating and maintaining authorization specific code.

Network Policy

With Istio’s centralized policy management, you can enforce networking rules at runtime. Common examples include, but are not limited to the following:

  • Whitelisting and blacklisting access to services, so that access is only granted to certain actors.
  • Rate limiting traffic, to ensure a bad actor does not cause a Denial of Service attack.
  • Redirecting requests, to enforce that certain actors go through proper channels when making their requests.

Cyber Attacks Mitigated by Zero Trust Networking With Istio

The following are example attacks that can be mitigated:

  1. Service Impersonation - A bad actor is able to gain access to the private network for your applications, pretends to be an authorized service, and starts making requests for sensitive data.
  2. Unauthorized Access - A legitimate service makes requests for sensitive data that it is not authorized to obtain. 
  3. Packet Sniffing - A bad actor gains access to your applications private network and captures sensitive data from legitimate requests going over the network.
  4. Data Exfiltration - A bad actor sends sensitive data out of the protected network to a destination of their choosing.

Applying Zero Trust Networking in Highly Regulated Industries

To combat increased high profile cyber attacks, regulations and standards are evolving to include stricter controls to enforce that organizations follow best practices when processing and storing sensitive data. 

The most common technical requirements across regulations and standards are:

  • Authentication - verify the identity of the actor seeking access to protected data.
  • Authorization - verify the actor is allowed to access the requested protected data.
  • Accounting - mechanisms for recording and examining activities within the system.
  • Data Integrity - protecting data from being altered or destroyed in an unauthorized manner.

As you may have noticed, applying Zero Trust Networking within your application infrastructure does not only increase your security posture and help mitigate cyber attacks, it also addresses control requirements set forth in regulations and standards, such as HIPAA, PCI-DSS, GDPR, and FISMA.

Use Istio to Achieve Zero Trust the Easy Way

High profile data breaches are at an all time high, cost an average of $3.92 million, and they take upwards of 314 days from breach to containment. Implementing Zero Trust Networking with Istio to secure your microservice architecture at scale is simple, requires little effort, and can be completed with minimal service disruption. If you would like to discuss the most effective way for your organization to achieve zero trust, grab some time to talk through your use case and how Aspen Mesh can help solve your security concerns.

Recommended Remediation for Kubernetes CVE-2019-11247

A Kubernetes vulnerability CVE-2019-11247 was announced this week.  While this is a vulnerability in Kubernetes itself, and not Istio, it may affect Aspen Mesh users. This blog will help you understand possible consequences and how to remediate.

  • You should mitigate this vulnerability by updating to Kubernetes 1.13.9, 1.14.5 or 1.15.2 as soon as possible.
  • This vulnerability affects the interaction of Roles (a Kubernetes RBAC resource) and CustomResources (Istio uses CustomResources for things like VirtualServices, Gateways, DestinationRules and more).
  • Only certain definitions of Roles are vulnerable.
  • Aspen Mesh's installation does not define any such Roles, but you might have defined them for other software in your cluster.
  • More explanation and recovery details below.


If you have a Role defined anywhere in your cluster with a "*" for resources or apiGroups, then anything that can use that Role could escalate to modify many CustomResources.

This Kubernetes issue and this helpful blog have extensive details and an example walkthrough that we'll summarize here.  Kubernetes Roles define sets of permissions for resources in one particular namespace (like "default").  They are not supposed to define permissions for resources in other namespaces or resources that live globally (outside of any namespace); that's what ClusterRoles are for.

Here's an example Role from Aspen Mesh that says "The IngressGateway should be able to get, watch or list any secrets in the istio-system namespace", which it needs to bootstrap secret discovery to get keys for TLS or mTLS:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
  name: istio-ingressgateway-sds
  namespace: istio-system
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get", "watch", "list"]

This Role does not grant permissions to secrets in any other namespace, or grant permissions to anything aside from secrets.   You'd need additional Roles and RoleBindings to add those. It doesn't grant permissions to get or modify cluster-wide resources.  You'd need ClusterRoles and ClusterRoleBindings to add those.

The vulnerability is that if the Role grants access to CustomResources in one namespace, it accidentally grants access to the same kinds of CustomResources that exist at global scope.  This is not exploitable in many cases because if you have a namespace-scoped CustomResource called "Foo", you can't also have a global CustomResource called "Foo", so there are no global-scope Foos to attack.  Unfortunately, if your role allows access to a resource "*" or apiGroup "*", then the "*" matches both namespace-scoped Foo and globally-scoped Bar in vulnerable versions of Kubernetes. This Role could be used to attack Bar.

If you're really scrutinizing the above example, note that the apiGroup "" is different than the apiGroup "*": the empty "" refers to core Kubernetes resources like Secrets, while "*" is a wildcard meaning any.

Aspen Mesh and Istio define three globally-scoped CustomResources: ClusterRbacConfig, MeshPolicy, ClusterIssuer.  If you had a vulnerable Role defined and an attacker could assume that Role, then they could have modified those three CustomResources.  Aspen Mesh does not provide any vulnerable Roles so a cluster would need to have those Roles defined for some other purpose.


First and as soon as possible, you should upgrade Kubernetes to a non-vulnerable version: 1.13.9, 1.14.5 or 1.15.2.

When you define Roles and ClusterRoles, you should follow the Principle of Least Privilege and avoid granting access to "*" - always prefer listing the specific resources required.

You can examine your cluster for vulnerable Roles.  If any exist, or have existed, those could have been exploited by an attacker with knowledge of this vulnerability to modify Aspen Mesh configuration.  To mitigate this, recreate any CustomResources after upgrading to a non-vulnerable version of Kubernetes.

This snippet will print out any vulnerable Roles currently configured, but cannot tell you if any may have existed in the past.  It relies on the jq tool:

kubectl get role --all-namespaces -o json |jq '.items[] | select(.rules[].resources | index("*"))'
kubectl get role --all-namespaces -o json |jq '.items[] | select(.rules[].apiGroups | index("*"))'

This is not a vulnerability in Aspen Mesh, so there is no need to upgrade Aspen Mesh.