Securing Containerized Applications With Service Mesh

The self-contained, ephemeral nature of microservices comes with some serious upside, but keeping track of every single one is a challenge, especially when trying to figure out how the rest are affected when a single microservice goes down. The end result is that if you’re operating or developing a microservices architecture, there’s a good chance part of your days are spent wondering what your services are up to.

With the adoption of microservices, problems also emerge due to the sheer number of services that exist in large systems. Problems like security, load balancing, monitoring and rate limiting that had to be solved once for a monolith, now have to be handled separately for each service.

The technology aimed at addressing these microservice challenges has been  rapidly evolving:

  1. Containers facilitate the shift from monolith to microservices by enabling independence between applications and infrastructure.
  2. Container orchestration tools solve microservices build and deploy issues, but leave many unsolved runtime challenges.
  3. Service mesh addresses runtime issues including service discovery, load balancing, routing and observability.

Securing services with a service mesh

A service mesh provides an advanced toolbox that lets users add security, stability and resiliency to containerized applications. One of the more common applications of a service mesh is bolstering cluster security. There are 3 distinct capabilities provided by the mesh that enable platform owners to create a more secure architecture.

Traffic Encryption  

As a platform operator, I need to provide encryption between services in the mesh. I want to leverage mTLS to encrypt traffic between services. I want the mesh to automatically encrypt and decrypt requests and responses, so I can remove that burden from my application developers. I also want it to improve performance by prioritizing the reuse of existing connections, reducing the need for the computationally expensive creation of new ones. I also want to be able to understand and enforce how services are communicating and prove it cryptographically.

Security at the Edge

As a platform operator, I want Aspen Mesh to add a layer of security at the perimeter of my clusters so I can monitor and address compromising traffic as it enters the mesh. I can use the built in power of Kubernetes as an ingress controller to add security with ingress rules such as whitelisting and blacklisting. I can also apply service mesh route rules to manage compromising traffic at the edge. I also want control over egress so I can dictate that our network traffic does not go places it shouldn't (blacklist by default and only talk to what you whitelist).

Role Based Access Control (RBAC)

As the platform operator, It’s important that I am able to provide the level of least privilege so the developers on my platform only have access to what they need, and nothing more. I want to enable controls so app developers can write policy for their apps and only their apps so that they can move quickly without impacting other teams. I want to use the same RBAC framework that I am familiar with to provide fine-grained RBAC within my service mesh.

How a service mesh adds security

You’re probably thinking to yourself, traffic encryption and fine-grained RBAC sound great, but how does a service mesh actually get me to them? Service meshes that leverage a sidecar approach are uniquely positioned intercept and encrypt data. A sidecar proxy is a prime insertion point to ensure that every service in a cluster is secured, and being monitored in real-time. Let’s explore some details around why sidecars are a great place for security.

Sidecar is a great place for security

Securing applications and infrastructure has always been daunting, in part because the adage really is true: you are only as secure as your weakest link.  Microservices are an opportunity to improve your security posture but can also cut the other way, presenting challenges around consistency.  For example, the best organizations use the principle of least privilege: an app should only have the minimum amount of permissions and privilege it needs to get its job done.  That's easier to apply where a small, single-purpose microservice has clear and narrowly-scoped API contracts.  But there's a risk that as application count increases (lots of smaller apps), this principle can be unevenly applied. Microservices, when managed properly, increase feature velocity and enable security teams to fulfill their charter without becoming the Department of No.

There's tension: Move fast, but don't let security coverage slip through the cracks.  Prefer many smaller things to one big monolith, but secure each and every one.  Let each team pick the language of their choice, but protect them with a consistent security policy.  Encourage app teams to debug, observe and maintain their own apps but encrypt all service-to-service communication.

A sidecar is a great way to balance these tensions with an architecturally sound security posture.  Sidecar-based service meshes like Istio and Linkerd 2.0 put their datapath functionality into a separate container and then situate that container as close to the application they are protecting as possible.  In Kubernetes, the sidecar container and the application container live in the same Kubernetes Pod, so the communication path between sidecar and app is protected inside the pod's network namespace; by default it isn't visible to the host or other network namespaces on the system.  The app, the sidecar and the operating system kernel are involved in communication over this path.  Compared to putting the security functionality in a library, using a sidecar adds the surface area of kernel loopback networking inside of a namespace, instead of just kernel memory management.  This is additional surface area, but not much.

The major drawbacks of library approaches are consistency and sprawl in polyglot environments.  If you have a few different languages or application frameworks and take the library approach, you have to secure each one.  This is not impossible, but it's a lot of work.  For each different language or framework, you get or choose a TLS implementation (perhaps choosing between OpenSSL and BoringSSL).  You need a configuration layer to load certificates and keys from somewhere and safely pass them down to the TLS implementation.  You need to reload these certs and rotate them.  You need to evaluate "information leakage" paths: does your config parser log errors in plaintext (so it by default might print the TLS key to the logs)?  Is it OK for app core dumps to contain these keys?  How often does your organization require re-keying on a connection?  By bytes or time or both?  Minimum cipher strength?  When a CVE in OpenSSL comes out, what apps are using that version and need updating?  Who on each app team is responsible for updating OpenSSL, and how quickly can they do it?  How many apps have a certificate chain built into them for consuming public websites even if they are internal-only?  How many Dockerfiles will you need to update the next time a public signing authority has to revoke one?  slowloris?

Your organization can do all this work.  In fact, parts probably already have - above is our list of painful app security experiences but you probably have your own additions.  It is a lot of cross-organizational effort and process to get it right.  And you have to get it right everywhere, or your weakest link will be exploited.  Now with microservices, you have even more places to get it right.  Instead, our advice is to focus on getting it right once in the sidecar, and then distributing the sidecar everywhere, and get back to adding business value instead of duplicating effort.

There are some interesting developments on the horizon like the use of kernel TLS to defer bulk and some asymmetric crypto operations to the kernel.  That's great:  Implementations should change and evolve.  The first step is providing a good abstraction so that apps can delegate to lower layers. Once that's solid, it's straightforward to move functionality from one layer to the next as needed by use case, because you don't perturb the app any more.  As precedent, consider TCP Segmentation Offload, which lets the network card manage splitting app data into the correct size for each individual packet.  This task isn't impossible for an app to do, but it turns out to be wasted effort.  By deferring TCP segmentation to the kernel, it left the realm of the app.  Then, kernels, network drivers, and network cards were free to focus on the interoperability and semantics required to perform TCP segmentation at the right place.  That's our position for this higher-level service-to-service communication security: move it outside of the app to the sidecar, and then let sidecars, platforms, kernels and networking hardware iterate.

Envoy Is a Great Sidecar

We use Envoy as our sidecar because it's lightweight, has some great features and good API-based configurability.  Here are some of our favorite parts about Envoy:

  • Configurable TLS Parameters: Envoy exposes all the TLS configuration points you'd expect (cipher strength, protocol versions, curves).  The advantage to using Envoy is that they're configured the same way for every app using the sidecar.
  • Mutual TLS: Typically TLS is used to authenticate the server to the client, and to encrypt communication.  What's missing is authenticating the client to the server - if you do this, then the server knows what is talking to it.  Envoy supports this bi-directional authentication out of the box, which can easily be incorporated into a SPIFFE system.  In today's complex and cloud datacenter, you're better off if you trust things based on cryptographic proof of what they are, instead of network perimeter protection of where they called from.
  • BoringSSL: This fork of OpenSSL removed huge amounts of code like implementations of obsolete ciphers and cleaned up lots of vestigial implementation details that had repeatedly been the source of security vulnerabilities.  It's a good default choice if you don't need any OpenSSL-specific functionality because it's easier to get right.
  • Security Audit: A security audit can't prove the absence of vulnerabilities but it can catch mistakes that demonstrate either architectural weaknesses or implementation sloppiness.  Envoy's security audit did find issues but in our opinion indicated a high level of security health.
  • Fuzzed and Bountied: Envoy is continuously fuzzed (exposed to malformed input to see if it crashes) and covered by Google's Patch Reward security bug bounty program.
  • Good API Granularity: API-based configuration doesn't mean "just serialize/deserialize your internal state and go."  Careful APIs thoughtfully map to the "personas" of what's operating them (even if those personas are other programs).  Envoy's xDS APIs in our experience partition routing behavior from cluster membership from secrets.  This makes it easy to make well-partitioned controllers.  A knock-on benefit is that it is easy in our experience to debug and test Envoy because config constructs usually map pretty clearly to code constructs.
  • No garbage collector: There are great languages with automatic memory management like Go that we use every day.  But we find languages like C++ and Rust provide predictable and optimizable tail latency.
  • Native Extensibility via Filters: Envoy has layer 4 and layer 7 extension points via filters that are written in C++ and linked into Envoy.
  • Scripting Extensibility via Lua: You can write Lua scripts as extension points as well.  This is very convenient for rapid prototyping and debugging.

One of these benefits deserves an even deeper dive in a security-oriented discussion.  The API granularity of Envoy is based on a scheme called "xDS" which we think of as follows:  Logically split the Envoy config API based on the user of that API.  The user in this case is almost always some other program (not a human), for instance a Service Mesh control plane element.

For instance, in xDS listeners ("How should I get requests from users?") are separated from clusters ("What pods or servers are available to handle requests to the shoppingcart service?").  The "x" in "xDS" is replaced with whatever functionality is implemented ("LDS" for listener discovery service).  Our favorite security-related partitioning is that the Secret Discovery Service can be used for propagating secrets to the sidecars independent of the other xDS APIs.

Because SDS is separate, the control plane can implement the Principle of Least Privilege: nothing outside of SDS needs to handle or have access to any private key material.

Mutual TLS is a great enhancement to your security posture in a microservices environment.  We see mutual TLS adoption as gradual - almost any real-world app will have some containerized microservices ready to join the service mesh and mTLS on day one.  But practically speaking, many of these will depend on mesh-external services, containerized or not.  It is possible in most cases to integrate these services into the same trust domain as the service mesh, and oftentimes these components can even participate in client TLS authentication so you get true mutual TLS.

In our experience, this happens by gradually expanding the "circle" of things protected with mutual TLS.  First, stateless containerized business logic, next in-cluster third party services, finally external state stores like bare metal databases.  That's why we focus on making the state of mTLS easy to understand in Aspen Mesh, and provide assistants to help you detect configuration mishaps.

What lives outside the sidecar?

You need a control plane to configure all of these sidecars.  In some simple cases it may be tempting to do this with some CI integration to generate configs plus DNS-based discovery.  This is viable but it's hard to do rapid certificate rotation.  Also, it leaves out more dynamic techniques like canaries, progressive delivery and A/B testing.  For this reason, we think most real-world applications will include an online control plane that should:

  • Disseminate configuration to each of the sidecars with a scalable approach.
  • Rotate sidecar certificates rapidly to reduce the value to an attacker of a one-time exploit of an application.
  • Collect metadata on what is communicating with what.

A good security posture means you should be automating some work on top of the control plane. We think these things are important (and built them into Aspen Mesh):

  • Organizing information to help humans narrow in on problems quickly.
  • Warning on potential misconfigurations.
  • Alerting when unhealthy communication is observed.
  • Inspect the firehose of metadata for surprises - these patterns could be application bugs or security issues or both.

If you’re considering or going down the Kubernetes path, you should be thinking about the unique security challenges that comes with microservices running in a Kubernetes cluster. Kubernetes solves many of these, but there are some critical runtime issues that a service mesh can make easier and more secure. If you would like to talk about how the Aspen Mesh platform and team can address your specific security challenge, feel free to find some time to chat with us.


Using Kubernetes RBAC to Control Global Configuration In Istio

Why Configuration Matters

If I'm going to get an error for my code, I like to get an error as soon as possible.  Unit test failures are better than integration test failures. I prefer compiler errors to unit test failures - that's what makes TypeScript great.  Going even further, a syntax highlighter is a very proximate feedback of errors - if that keyword doesn't turn green, I can fix the spelling with almost no conscious burden.

Shifting from coding to configuration, I like narrow configuration systems that make it easy to specify a valid configuration and tell me as quickly as possible when I've done something wrong.  In fact, my dream configuration specification wouldn't allow me to specify something invalid. Remember Newspeak from 1984?  A language so narrow that expressing ungood thoughts become impossible.  Apparently, I like my programming and configuration languages to be dystopic and Orwellian.

If you can't further narrow the language of your configuration without giving away core functionality, the next best option is to narrow the scope of configuration.  I think about this in reverse - if I am observing some behavior from the system and say to myself, "Huh, that's weird", how much configuration do I have to look at before I understand?  It's great if this is a small and expanding ring - look at config that's very local to this particular object, then the next layer up (in Kubernetes, maybe the rest of the namespace), and so on to what is hopefully a very small set of global config.  Ever tried to debug a program with global variables all over the place? Not fun.

My three principles of ideal config:

  1. Narrow Like Newspeak, don't allow me to even think of invalid configuration.
  2. Scope Only let me affect a small set of things associated with my role.
  3. Time Tell me as early as possible if it's broken.

Declarative config is readily narrow.  The core philosophy of declarative config is saying what you want, not how you get it.  For example, "we'll meet at the Denver Zoo at noon" is declarative. If instead I specify driving directions to the Denver Zoo, I'm taking a much more imperative approach.  What if you want to bike there? What if there is road construction and a detour is required? The value of declarative config is that if we focus on what we want, instead of how to get it, it's easier for me to bring my controller (my car's GPS) and you to bring yours (Google Maps in the Bike setting).

On the other hand, a big part of configuration is pulling together a bunch of disparate pieces of information together at the last moment, from a bunch of different roles (think humans, other configuration systems and controllers) just before the system actually starts running.  Some amount of flexibility is required here.

Does Cloud Native Get Us To Better Configuration?

I think a key reason for the popularity of Kubernetes is that it has a great syntax for specifying what a healthy, running microservice looks like.  Its syntax is powerful enough in all the right places to be practical for infrastructure.

Service meshes like Istio robustly connect all the microservices running in your cluster.  They can adaptively route L7 traffic, provide end-to-end mTLS based encryption, and provide circuit breaking and fault injection.  The long feature list is great, but it's not surprising that the result is a somewhat complex set of configuration resources. It's a natural result of the need for powerful syntax to support disparate use cases coupled with rapid development.

Enabling Fine-grained RBAC with Traffic Claim Enforcer

At Aspen Mesh, we found users (including ourselves) spending too much time understanding misconfiguration.  The first way we addressed that problem was with Istio Vet, which is designed to warn you of probably incorrect or incomplete config, and provide guidance to fix it.  Sometimes we know enough that we can prevent the misconfiguration by refusing to allow it in the first place.  For some Istio config resources, we do that using a solution we call Traffic Claim Enforcer.

There are four Istio configuration resources that have global implications: VirtualService, Gateway, ServiceEntry and DestinationRule.  Whenever you create one of these resources, you create it in a particular namespace. They can affect how traffic flows through the service mesh to any target they specify, even if that target isn't in the current namespace.  This surfaces a scope anti-pattern - if I'm observing weird behavior for some service, I have to examine potentially all DestinationRules in the entire Kubernetes cluster to understand why.

That might work in the lab, but we found it to be a serious problem for applications running in production.  Not only is it hard to understand the current config state of the system, it's also easy to break. It’s important to have guardrails that make it so the worst thing I can mess up when deploying my tiny microservice is my tiny microservice.  I don't want the power to mess up anything else, thank you very much. I don't want sudo. My platform lead really doesn't want me to have sudo.

Traffic Claim Enforcer is an admission webhook that waits for a user to configure one of those resources with global implications, and before allowing will check:

  1. Does the resource have a narrow scope that affects only local things?
  2. Is there a TrafficClaim that grants the resource the broader scope requested?

A TrafficClaim is a new Kubernetes custom resource we defined that exists solely to narrow and define the scope of resources in a namespace.  Here are some examples:

kind: TrafficClaim
apiVersion: networking.aspenmesh.io/v1alpha3
metadata:
  name: allow-public
  namespace: cluster-public
claims:
# Anything on www.example.com
- hosts: [ "www.example.com" ]

# Only specific paths on foo.com, bar.com
- hosts: [ "foo.com", "bar.com" ]
  ports: [ 80, 443, 8080 ]
  http:
    paths:
      exact: [ "/admin/login" ]
      prefix: [ "/products" ]

# An external service controlled by ServiceEntries
- hosts: [ "my.external.com" ]
  ports: [ 80, 443, 8080, 8443 ]

TrafficClaims are controlled by Kubernetes Role-Based Access Control (RBAC).  Generally, the same roles or people that create namespaces and set up projects would also create TrafficClaims for those namespaces that need power to define service mesh traffic policy outside of their namespace scope.  Rule 1 about local scope above can be explained as "every namespace has an implied TrafficClaim for namespace-local traffic policy", to avoid requiring a boilerplate TrafficClaim.

A pattern we use is to put global config into a namespace like "istio-public" - that's the only place that needs TrafficClaims for things like public DNS names.  Or you might have a couple of namespaces like "istio-public-prod" and "istio-public-dev" or similar. It’s up to you.

Traffic Claim Enforcer does not prevent you from thinking of invalid config, but it does help to limit scope. If I'm trying to understand what happens when traffic goes to my microservice, I no longer have to examine every DestinationRule in the system.  I only have to examine the ones in my namespace, and maybe some others that have special TrafficClaims (and hopefully keep that list small).

Traffic Claim Enforcer also provides an early failure for config problems.  Without it, it is easy to create conflicting DestinationRules even in separate namespaces. This is a problem that Istio-Vet will tell you about, but cannot fix - it doesn't know which one should have priority. If you define TrafficClaims, then Traffic Claim Enforcer can prevent you from configuring it at all.

Hat tip to my colleague, Brian Marshall, who developed the initial public spec for TrafficClaims.  The Istio community is undertaking a great deal of work to scope/manage config aimed at improving system scalability.  We made Traffic Claim Enforcer with a focus on scoping to improve the human config experience as it was a need expressed by several of our users.  We're optimistic that the way Traffic Claim Enforcer helps with human concerns will complement the system scalability side of things.

If you want to give Traffic Claim Enforcer a spin, it's included as part of Aspen Mesh.  By default it doesn't enforce anything, so out-of-the-box it is compatible with Istio. You can turn it on globally or on a namespace-by-namespace basis.

Click play below to check it out in action!

https://youtu.be/47HzynDsD8w

 


Security

A Service Mesh Helps Simplify PCI DSS Compliance

PCI DSS is an information security standard for organizations that handle credit card data. The requirements are largely around developing and maintaining secure systems and applications and providing appropriate levels of access — things a service mesh can make easier.

However, building a secure, reliable and PCI DSS-compliant microservice architecture at scale is a difficult undertaking, even when using a service mesh.

It requires, for example, 12 separate requirements, each of which has different sub-requirements. Additionally, some of the requirements are vague and are left up to the designated Qualified Security Assessor (QSA) to make their best judgment based on the design in question.

Meeting these requirements involves:

  • Controlling what services can talk to each other;
  • Guaranteeing non-repudiation for actors making requests;
  • Building accurate real-time and historical diagrams of cardholder data flows across systems and networks when services can be added, removed or updated at a team’s discretion.

Achieving PCI DSS compliance at scale can be simplified by implementing a uniform layer of infrastructure between services and the network that provides your operations team with centralized policy management and decouples them from feature development and release processes, regardless of scale and release velocity. This layer of infrastructure is commonly referred to as a service mesh. A service mesh provides many features that simplify compliance management, such as fine-grained control over service communication, traffic encryption, non-repudiation via service to service authentication with strong identity assertions, and rich telemetry and reporting.

Below are some of the key PCI DSS requirements listed in the 3.2.1 version of the requirements document, where a service mesh helps simplify the implementation of both controls and reporting:

  • Requirement 1: Install and maintain a firewall configuration to protect cardholder data
    • The first requirement focuses on firewall and router configurations that ensure cardholder data is only accessed when it should be and only by authorized sources.
  • Requirement 6: Develop and maintain secure systems and applications
    • The applicable portions of this requirement focus on encrypting network traffic using strong cryptography and restricting user access to URLs and functions.
  • Requirement 7: Restrict access to cardholder data by business need to know.
    • Arguably one of the most critical requirements in PCI DSS, since even the most secure system can be easily circumvented by overprivileged employees. This requirement focuses on restricting privileged users to least privileges necessary to perform job responsibilities, ensuring access to systems are set to “deny all” by default, and ensuring proper documentation detailing roles and responsibilities are in place.
  • Requirement 8: Identify and authenticate access to system components
    • Building on the foundation of requirement 7, this requirement focuses on ensuring all users have a unique ID; controlling the creation, deletion and modification of identifier objects; revoking access; utilizing strong cryptography during credential transmission; verifying user identity before modifying authentication credentials.
  • Requirement 10: Track and monitor all access to network resources and cardholder data
    • This requirement puts a heavy emphasis on designing and implementing a secure, reliable and accurate audit trail of all events within the environment. This includes capturing all individual user access to cardholder data, invalid logical access attempts and intersystem communication logs. All audit trail entries should include: user identification, type of event, date and time, success or failure indication, origin of event and the identity of the affected data or resource.

Let’s review how Aspen Mesh, the enterprise-ready service mesh built on Istio, helps simplify both the implementation and reporting of controls for the above requirements.

‘Auditable’ Real-time and Historical Dataflows

Keeping records of how data flows through your system is one of key requirements of PCI DSS compliance (1.1.3), as well as a security best practice. With Aspen Mesh, you can see a live view of your service-to-service communications and retrieve historical configurations detailing what services were communicating, their corresponding security configuration (e.g. whether or not mutual TLS was enabled, certificate thumbprint and validity period, internal IP address and protocol) and what the entire cluster configuration was at that point in time (minus secrets of course).

Encrypting Traffic and Achieving Non-Repudiation for Requests

Defense in depth is an industry best practice when securing sensitive data. Aspen Mesh can automatically encrypt and decrypt service requests and responses without development teams having to write custom logic per service. This helps reduce the amount of non-functional feature development work teams are required to do prior to delivering their services in a secure and compliant environment. The mesh also provides non-repudiation for requests by authenticating clients via mutual TLS and through a key management system that automates key and certificate generation, distribution and rotation, so you can be sure requests are actually coming from who they say they’re coming from. Encrypting traffic and providing non-repudiation is key to implementing controls that ensure sensitive data, such as a Primary Account Number (PAN), are protected from unauthorized actors sniffing traffic, and to providing definitive proof for auditing events as required by PCI DSS requirements 10.3.1, 10.3.5, and 10.3.6.

Strong Access Control and Centralized Policy Management

Aspen Mesh provides flexible and highly performant Role-Based Access Control (RBAC) via centralized policy management. RBAC allows you to easily implement and report on controls for requirements 7 and 8 – Implement Strong Access Control Measures. With policy control, you can easily define what services are allowed to communicate, what methods services can call, rate limit requests, and define and enforce quotas.

Centralized Tracing, Monitoring and Alerting

One of the most difficult non-functional features to implement at scale is consistent and reliable application tracing. With Aspen Mesh, you get reliable and consistent in-depth tracing between all services within the mesh, configurable real-time dashboards, the ability to create criteria driven alerts and the ability to retain your logs for at least one year — which exceeds requirements that dictate a minimum of three months data be immediately available for analysis for PCI DSS requirements 10.1, 10.2-10.2.4, 10.3.1-10.3.6 and 10.7.

Aspen Mesh Makes it Easier to Implement and Scale a Secure and PCI DSS Compliant Microservice Environment

Managing a microservice architecture at scale is a serious challenge without the right tools. Having to ensure each service follows the proper organizational secure communication, authentication, authorization and monitoring policies needed to comply with PCI DSS is not easy to achieve.

Achieving PCI DSS compliance involves addressing a number of different things around firewall configuration, developing applications securely and fine-grained RBAC. These are all distinct development efforts that can be hard to achieve individually, but even harder to achieve as a coordinated team. The good news is, with the help of Aspen Mesh, your engineering team can spend less time building and maintaining non-functional yet essential features, and more time building features that provide direct value to your customers.

 

Originally posted on The New Stack