Using Kubernetes RBAC to Control Global Configuration In Istio

Why Configuration Matters

If I'm going to get an error for my code, I like to get an error as soon as possible.  Unit test failures are better than integration test failures. I prefer compiler errors to unit test failures - that's what makes TypeScript great.  Going even further, a syntax highlighter is a very proximate feedback of errors - if that keyword doesn't turn green, I can fix the spelling with almost no conscious burden.

Shifting from coding to configuration, I like narrow configuration systems that make it easy to specify a valid configuration and tell me as quickly as possible when I've done something wrong.  In fact, my dream configuration specification wouldn't allow me to specify something invalid. Remember Newspeak from 1984?  A language so narrow that expressing ungood thoughts become impossible.  Apparently, I like my programming and configuration languages to be dystopic and Orwellian.

If you can't further narrow the language of your configuration without giving away core functionality, the next best option is to narrow the scope of configuration.  I think about this in reverse - if I am observing some behavior from the system and say to myself, "Huh, that's weird", how much configuration do I have to look at before I understand?  It's great if this is a small and expanding ring - look at config that's very local to this particular object, then the next layer up (in Kubernetes, maybe the rest of the namespace), and so on to what is hopefully a very small set of global config.  Ever tried to debug a program with global variables all over the place? Not fun.

My three principles of ideal config:

  1. Narrow Like Newspeak, don't allow me to even think of invalid configuration.
  2. Scope Only let me affect a small set of things associated with my role.
  3. Time Tell me as early as possible if it's broken.

Declarative config is readily narrow.  The core philosophy of declarative config is saying what you want, not how you get it.  For example, "we'll meet at the Denver Zoo at noon" is declarative. If instead I specify driving directions to the Denver Zoo, I'm taking a much more imperative approach.  What if you want to bike there? What if there is road construction and a detour is required? The value of declarative config is that if we focus on what we want, instead of how to get it, it's easier for me to bring my controller (my car's GPS) and you to bring yours (Google Maps in the Bike setting).

On the other hand, a big part of configuration is pulling together a bunch of disparate pieces of information together at the last moment, from a bunch of different roles (think humans, other configuration systems and controllers) just before the system actually starts running.  Some amount of flexibility is required here.

Does Cloud Native Get Us To Better Configuration?

I think a key reason for the popularity of Kubernetes is that it has a great syntax for specifying what a healthy, running microservice looks like.  Its syntax is powerful enough in all the right places to be practical for infrastructure.

Service meshes like Istio robustly connect all the microservices running in your cluster.  They can adaptively route L7 traffic, provide end-to-end mTLS based encryption, and provide circuit breaking and fault injection.  The long feature list is great, but it's not surprising that the result is a somewhat complex set of configuration resources. It's a natural result of the need for powerful syntax to support disparate use cases coupled with rapid development.

Enabling Fine-grained RBAC with Traffic Claim Enforcer

At Aspen Mesh, we found users (including ourselves) spending too much time understanding misconfiguration.  The first way we addressed that problem was with Istio Vet, which is designed to warn you of probably incorrect or incomplete config, and provide guidance to fix it.  Sometimes we know enough that we can prevent the misconfiguration by refusing to allow it in the first place.  For some Istio config resources, we do that using a solution we call Traffic Claim Enforcer.

There are four Istio configuration resources that have global implications: VirtualService, Gateway, ServiceEntry and DestinationRule.  Whenever you create one of these resources, you create it in a particular namespace. They can affect how traffic flows through the service mesh to any target they specify, even if that target isn't in the current namespace.  This surfaces a scope anti-pattern - if I'm observing weird behavior for some service, I have to examine potentially all DestinationRules in the entire Kubernetes cluster to understand why.

That might work in the lab, but we found it to be a serious problem for applications running in production.  Not only is it hard to understand the current config state of the system, it's also easy to break. It’s important to have guardrails that make it so the worst thing I can mess up when deploying my tiny microservice is my tiny microservice.  I don't want the power to mess up anything else, thank you very much. I don't want sudo. My platform lead really doesn't want me to have sudo.

Traffic Claim Enforcer is an admission webhook that waits for a user to configure one of those resources with global implications, and before allowing will check:

  1. Does the resource have a narrow scope that affects only local things?
  2. Is there a TrafficClaim that grants the resource the broader scope requested?

A TrafficClaim is a new Kubernetes custom resource we defined that exists solely to narrow and define the scope of resources in a namespace.  Here are some examples:

kind: TrafficClaim
apiVersion: networking.aspenmesh.io/v1alpha3
metadata:
  name: allow-public
  namespace: cluster-public
claims:
# Anything on www.example.com
- hosts: [ "www.example.com" ]

# Only specific paths on foo.com, bar.com
- hosts: [ "foo.com", "bar.com" ]
  ports: [ 80, 443, 8080 ]
  http:
    paths:
      exact: [ "/admin/login" ]
      prefix: [ "/products" ]

# An external service controlled by ServiceEntries
- hosts: [ "my.external.com" ]
  ports: [ 80, 443, 8080, 8443 ]

TrafficClaims are controlled by Kubernetes Role-Based Access Control (RBAC).  Generally, the same roles or people that create namespaces and set up projects would also create TrafficClaims for those namespaces that need power to define service mesh traffic policy outside of their namespace scope.  Rule 1 about local scope above can be explained as "every namespace has an implied TrafficClaim for namespace-local traffic policy", to avoid requiring a boilerplate TrafficClaim.

A pattern we use is to put global config into a namespace like "istio-public" - that's the only place that needs TrafficClaims for things like public DNS names.  Or you might have a couple of namespaces like "istio-public-prod" and "istio-public-dev" or similar. It’s up to you.

Traffic Claim Enforcer does not prevent you from thinking of invalid config, but it does help to limit scope. If I'm trying to understand what happens when traffic goes to my microservice, I no longer have to examine every DestinationRule in the system.  I only have to examine the ones in my namespace, and maybe some others that have special TrafficClaims (and hopefully keep that list small).

Traffic Claim Enforcer also provides an early failure for config problems.  Without it, it is easy to create conflicting DestinationRules even in separate namespaces. This is a problem that Istio-Vet will tell you about, but cannot fix - it doesn't know which one should have priority. If you define TrafficClaims, then Traffic Claim Enforcer can prevent you from configuring it at all.

Hat tip to my colleague, Brian Marshall, who developed the initial public spec for TrafficClaims.  The Istio community is undertaking a great deal of work to scope/manage config aimed at improving system scalability.  We made Traffic Claim Enforcer with a focus on scoping to improve the human config experience as it was a need expressed by several of our users.  We're optimistic that the way Traffic Claim Enforcer helps with human concerns will complement the system scalability side of things.

If you want to give Traffic Claim Enforcer a spin, it's included as part of Aspen Mesh.  By default it doesn't enforce anything, so out-of-the-box it is compatible with Istio. You can turn it on globally or on a namespace-by-namespace basis.

Click play below to check it out in action!

https://youtu.be/47HzynDsD8w

 


Leveraging Service Mesh To Address HIPAA Security Requirements

Building a product utilizing a distributed microservice architecture for the healthcare industry, while following the requirements set forth in the Health Insurance Portability and Accountability Act (HIPAA), is hard. Trust me, I have felt the pain. I’ve spent the majority of my career building, securing and ensuring compliance for products in highly regulated industries, including healthcare. The healthcare industry is a necessity for all, which is causing it to grow at a rapid pace as new advancements are made. This is great for our health and wellbeing, but it starts to pose new challenges for organizations that process and store sensitive data such as Personally Identifiable Information (PII) and Electronic Protected Health Information (ePHI). What used to be to be a system of paper charts in manila envelopes stored in filing cabinets, is now a large interconnected system where patient medications, x-rays, surgeries, diagnosis and other health related data are transferred between internal and external entities. This advancement has allowed physicians to quickly provide other entities with your entire medical history, so you receive the best care possible, as quickly as possible. But this exchange does not come without risk. Anytime you make something more accessible, you also introduce new attack surfaces and points of failure, allowing data to be leaked and increasing the possibility of malicious attacks.

The HIPAA Security Rule was created to help address this new risk. It mandates that organizations that process or store ePHI follow certain safeguards to protect sensitive data.

The technical safeguard standards introduced by the Security Rule include:

  • Authentication - verification of the identity of the actor seeking access to protected data.
  • Authorization - verification that the actor is allowed to access the requested protected data.
  • Audit Controls - mechanisms for recording and examining activities pertaining to protected data within the system.
  • Data Integrity - protecting the data from being altered or destroyed in an unauthorized manner.

Implementing these safeguards may seem like an obvious thing to do when processing or storing sensitive data, but all too often they are overlooked or may be deemed too difficult, expensive and/or time consuming to implement with available resources. No matter the reason, this is a violation in the eyes of the U.S Department of Health and Human Services (HHS) Office for Civil Rights (OCR) and can result in fines up to $1.5 million a year for each violation and can even result in criminal charges. Fortunately, a service mesh helps address many of these standards in a way that requires less effort than building custom controls, and is also less error prone.

Let’s take a look at how you can leverage Aspen Mesh, the enterprise-ready service mesh built on Istio, to easily implement controls to address many of these standards that would otherwise require significant development effort and expertise.

Authentication
As briefly discussed, authentication is the verification of the identity of the actor seeking access to protected data. With Aspen Mesh, you can easily configure mesh wide service-to-service authentication and end-user authentication with little effort. In fact, if you use the recommended default Aspen Mesh installation, it will enable mesh wide mTLS automatically without requiring any code changes.

Now that you have service-to-service authentication and transport encryption enabled, the next step is to enable end-user authentication.

Below is an example of how you would enable end-user authentication on a Patient Check-in Service using an external Identity Management Service that supports JWTs (e.g. Azure Active Directory B2C, Amazon Cognito, Auth0, Okta, GSuite), so reception personnel can securely login and check-in patients as they arrive.

1. You’re going to need to make note of the JWT Issuer and JWK URI from your User Directory Service.
2. Create and apply a Policy called patients-checkin-user-auth that configures end user authentication to the Patient Check-in Service using your JWT supported Identity Management Service of choice.

apiVersion: "authentication.istio.io/v1alpha1"
kind: "Policy"
metadata:
  name: "patients-checkin-user-auth"
spec:
  targets:
  - name: patient-checkin
  peers:
  - mtls:
  origins:
  - jwt:
      issuer: "<REPLACE_WITH_YOUR_JWT_SUPPORTED_IDENTITY_MANAGEMENT_SERVICE_ISSUER>"
      jwksUri: "<REPLACE_WITH_YOUR_JWT_SUPPORTED_IDENTITY_MANAGEMENT_SERVICE_USER_DIRECTORY_JWKS_URI>"
  principalBinding: USE_ORIGIN

3. Ensure that the Patient Check-in frontend application places the JWT token in the Authorization header in http requests to the backend services
4. That’s it!

Authorization
Aspen Mesh provides flexible and fine-grained Role-Based Access Control (RBAC) via centralized policy management. With policy control, you can easily define what services are allowed to communicate, what methods services can call, rate limit requests and define and enforce quotas.

Below is a simple example of how a Patient Check-in Service can make GET, PUT, and POST requests to the Patients Service, but can’t make DELETE requests. While the Admin Service can make GET, POST, PUT, and DELETE requests to the Patients Service.  

1. Create a ServiceRole called patient-service-querie which allows making GET, PUT, POST requests to the Patients Service.

apiVersion: "rbac.istio.io/v1alpha1"
kind: ServiceRole
metadata:
  name: patient-service-querier
  namespace: default
spec:
  rules:
  - services: ["patients.default.svc.cluster.local"]
    methods: ["GET", “PUT”, “POST”]

2. Create another ServiceRole called patients-admin that allows GET, POST, PUT, and DELETE requests to the Patients Service.

apiVersion: "rbac.istio.io/v1alpha1"
kind: ServiceRole
metadata:
  name: patients-admin
  namespace: default
spec:
  rules:
  - services: ["patients.default.svc.cluster.local"]
    methods: ["GET", "POST", "PUT", DELETE]

3. Create a ServiceRoleBinding called bind-patient-service-querier which assigns patient-querier role to the cluster.local/ns/default/sa/patient-check-in service account, which represents the Patient Check-In Service.

apiVersion: "rbac.istio.io/v1alpha1"
kind: ServiceRoleBinding
metadata:
  name: bind-patient-service-querier
  namespace: default
spec:
  subjects:
  - user: "cluster.local/ns/default/sa/patient-check-in"
  roleRef:
    kind: ServiceRole
    name: "patient-querier"

4. Lastly we’ll create another ServiceRoleBinding called bind-patient-service-admin which assigns patient-admin role to the cluster.local/ns/default/sa/admin service account, which represents the Admin Service.

apiVersion: "rbac.istio.io/v1alpha1"
kind: ServiceRoleBinding
metadata:
  name: bind-patient-service-admin
  namespace: default
spec:
  subjects:
  - user: "cluster.local/ns/default/sa/admin"
  roleRef:
    kind: ServiceRole
    name: "patient-admin"

As you can see, you can quickly and effectively add Authorization between services in your mesh without any custom development work.

Audit Controls
Keeping audit records of data access is one of the key requirements for HIPAA compliance, as well as a security best practice. With Aspen Mesh, you get a single source of truth with in-depth tracing between all services within the mesh. Traces can be accessed and exported via the ‘Tracing’ tab on the Aspen Mesh Dashboard or API. You may still need to add the corresponding audit logs for specific actions that happen within a service to comply with all of the requirements, but at least you have reduced the amount of engineering effort spent on non-functional but essential tasks.

Data Integrity
Data integrity and confidentiality is arguably one of the most critical requirements of HIPAA. If sensitive data such as medications, blood type or allergies are modified by or leaked to an unauthorized user, it could be detrimental to the patient. With Aspen Mesh you can quickly and easily enable transport encryption, service-to-service authentication, authorization and monitoring so you can more easily comply with HIPAA requirements and protect patient data.

Aspen Mesh Makes it Easier to Implement and Scale a Secure HIPAA Compliant Microservice Environment
Building a HIPAA compliant microservice architecture at scale is a serious challenge without the right tools. Having to ensure each service adheres to both organizational and regulatory compliance requirements is not an easy task.  

Achieving HIPAA compliance involves addressing a number of technical security requirements such as network protection, encryption and key management, identification and authorization of users and auditing access to systems. These are all distinct development efforts that can be hard to achieve individually, but even harder to achieve as a coordinated team. The good news is, with the help of Aspen Mesh, your engineering team can spend less time building and maintaining non-functional yet essential features, and more time building features that provide direct value to your customers.