Leveraging Service Mesh To Address HIPAA Security Requirements

Building a product utilizing a distributed microservice architecture for the healthcare industry, while following the requirements set forth in the Health Insurance Portability and Accountability Act (HIPAA), is hard. Trust me, I have felt the pain. I’ve spent the majority of my career building, securing and ensuring compliance for products in highly regulated industries, including healthcare. The healthcare industry is a necessity for all, which is causing it to grow at a rapid pace as new advancements are made. This is great for our health and wellbeing, but it starts to pose new challenges for organizations that process and store sensitive data such as Personally Identifiable Information (PII) and Electronic Protected Health Information (ePHI). What used to be to be a system of paper charts in manila envelopes stored in filing cabinets, is now a large interconnected system where patient medications, x-rays, surgeries, diagnosis and other health related data are transferred between internal and external entities. This advancement has allowed physicians to quickly provide other entities with your entire medical history, so you receive the best care possible, as quickly as possible. But this exchange does not come without risk. Anytime you make something more accessible, you also introduce new attack surfaces and points of failure, allowing data to be leaked and increasing the possibility of malicious attacks.

The HIPAA Security Rule was created to help address this new risk. It mandates that organizations that process or store ePHI follow certain safeguards to protect sensitive data.

The technical safeguard standards introduced by the Security Rule include:

  • Authentication - verification of the identity of the actor seeking access to protected data.
  • Authorization - verification that the actor is allowed to access the requested protected data.
  • Audit Controls - mechanisms for recording and examining activities pertaining to protected data within the system.
  • Data Integrity - protecting the data from being altered or destroyed in an unauthorized manner.

Implementing these safeguards may seem like an obvious thing to do when processing or storing sensitive data, but all too often they are overlooked or may be deemed too difficult, expensive and/or time consuming to implement with available resources. No matter the reason, this is a violation in the eyes of the U.S Department of Health and Human Services (HHS) Office for Civil Rights (OCR) and can result in fines up to $1.5 million a year for each violation and can even result in criminal charges. Fortunately, a service mesh helps address many of these standards in a way that requires less effort than building custom controls, and is also less error prone.

Let’s take a look at how you can leverage Aspen Mesh, the enterprise-ready service mesh built on Istio, to easily implement controls to address many of these standards that would otherwise require significant development effort and expertise.

Authentication
As briefly discussed, authentication is the verification of the identity of the actor seeking access to protected data. With Aspen Mesh, you can easily configure mesh wide service-to-service authentication and end-user authentication with little effort. In fact, if you use the recommended default Aspen Mesh installation, it will enable mesh wide mTLS automatically without requiring any code changes.

Now that you have service-to-service authentication and transport encryption enabled, the next step is to enable end-user authentication.

Below is an example of how you would enable end-user authentication on a Patient Check-in Service using an external Identity Management Service that supports JWTs (e.g. Azure Active Directory B2C, Amazon Cognito, Auth0, Okta, GSuite), so reception personnel can securely login and check-in patients as they arrive.

1. You’re going to need to make note of the JWT Issuer and JWK URI from your User Directory Service.
2. Create and apply a Policy called patients-checkin-user-auth that configures end user authentication to the Patient Check-in Service using your JWT supported Identity Management Service of choice.

apiVersion: "authentication.istio.io/v1alpha1"
kind: "Policy"
metadata:
  name: "patients-checkin-user-auth"
spec:
  targets:
  - name: patient-checkin
  peers:
  - mtls:
  origins:
  - jwt:
      issuer: "<REPLACE_WITH_YOUR_JWT_SUPPORTED_IDENTITY_MANAGEMENT_SERVICE_ISSUER>"
      jwksUri: "<REPLACE_WITH_YOUR_JWT_SUPPORTED_IDENTITY_MANAGEMENT_SERVICE_USER_DIRECTORY_JWKS_URI>"
  principalBinding: USE_ORIGIN

3. Ensure that the Patient Check-in frontend application places the JWT token in the Authorization header in http requests to the backend services
4. That’s it!

Authorization
Aspen Mesh provides flexible and fine-grained Role-Based Access Control (RBAC) via centralized policy management. With policy control, you can easily define what services are allowed to communicate, what methods services can call, rate limit requests and define and enforce quotas.

Below is a simple example of how a Patient Check-in Service can make GET, PUT, and POST requests to the Patients Service, but can’t make DELETE requests. While the Admin Service can make GET, POST, PUT, and DELETE requests to the Patients Service.  

1. Create a ServiceRole called patient-service-querie which allows making GET, PUT, POST requests to the Patients Service.

apiVersion: "rbac.istio.io/v1alpha1"
kind: ServiceRole
metadata:
  name: patient-service-querier
  namespace: default
spec:
  rules:
  - services: ["patients.default.svc.cluster.local"]
    methods: ["GET", “PUT”, “POST”]

2. Create another ServiceRole called patients-admin that allows GET, POST, PUT, and DELETE requests to the Patients Service.

apiVersion: "rbac.istio.io/v1alpha1"
kind: ServiceRole
metadata:
  name: patients-admin
  namespace: default
spec:
  rules:
  - services: ["patients.default.svc.cluster.local"]
    methods: ["GET", "POST", "PUT", DELETE]

3. Create a ServiceRoleBinding called bind-patient-service-querier which assigns patient-querier role to the cluster.local/ns/default/sa/patient-check-in service account, which represents the Patient Check-In Service.

apiVersion: "rbac.istio.io/v1alpha1"
kind: ServiceRoleBinding
metadata:
  name: bind-patient-service-querier
  namespace: default
spec:
  subjects:
  - user: "cluster.local/ns/default/sa/patient-check-in"
  roleRef:
    kind: ServiceRole
    name: "patient-querier"

4. Lastly we’ll create another ServiceRoleBinding called bind-patient-service-admin which assigns patient-admin role to the cluster.local/ns/default/sa/admin service account, which represents the Admin Service.

apiVersion: "rbac.istio.io/v1alpha1"
kind: ServiceRoleBinding
metadata:
  name: bind-patient-service-admin
  namespace: default
spec:
  subjects:
  - user: "cluster.local/ns/default/sa/admin"
  roleRef:
    kind: ServiceRole
    name: "patient-admin"

As you can see, you can quickly and effectively add Authorization between services in your mesh without any custom development work.

Audit Controls
Keeping audit records of data access is one of the key requirements for HIPAA compliance, as well as a security best practice. With Aspen Mesh, you get a single source of truth with in-depth tracing between all services within the mesh. Traces can be accessed and exported via the ‘Tracing’ tab on the Aspen Mesh Dashboard or API. You may still need to add the corresponding audit logs for specific actions that happen within a service to comply with all of the requirements, but at least you have reduced the amount of engineering effort spent on non-functional but essential tasks.

Data Integrity
Data integrity and confidentiality is arguably one of the most critical requirements of HIPAA. If sensitive data such as medications, blood type or allergies are modified by or leaked to an unauthorized user, it could be detrimental to the patient. With Aspen Mesh you can quickly and easily enable transport encryption, service-to-service authentication, authorization and monitoring so you can more easily comply with HIPAA requirements and protect patient data.

Aspen Mesh Makes it Easier to Implement and Scale a Secure HIPAA Compliant Microservice Environment
Building a HIPAA compliant microservice architecture at scale is a serious challenge without the right tools. Having to ensure each service adheres to both organizational and regulatory compliance requirements is not an easy task.  

Achieving HIPAA compliance involves addressing a number of technical security requirements such as network protection, encryption and key management, identification and authorization of users and auditing access to systems. These are all distinct development efforts that can be hard to achieve individually, but even harder to achieve as a coordinated team. The good news is, with the help of Aspen Mesh, your engineering team can spend less time building and maintaining non-functional yet essential features, and more time building features that provide direct value to your customers.


Enterprise service mesh

Aspen Mesh Open Beta Makes Istio Enterprise-ready

As companies build modern applications, they are leveraging microservices to effectively build and manage them. As they do, they realize that they are increasing complexity and de-centralizing ownership and control. These new challenges require a new way to monitor, manage and control microservice-based applications at runtime.

A service mesh is an emerging pattern that is helping ensure resiliency and uptime - a way to more effectively monitor, control and secure the modern application at runtime. Companies are adopting Istio as their service mesh of choice as it provides a toolbox of different features that address various microservices challenges. Istio provide a solution to many challenges, but leaves some critical enterprise challenges on the table. Enterprises require additional features that address observability, policy and security. With this in mind, we have built new enterprise features into a platform that runs on top of Istio, to provide all the functionality and flexibility of open source, plus features, support and guarantees needed to power enterprise applications.

At KubeCon North America 2018, Aspen Mesh announced open beta. With Aspen Mesh you get all the features of Istio, plus:

Advanced Policy Features
Aspen Mesh provides RBAC capabilities you don’t get with Istio.

Configuration Vets
Istio Vet (an Aspen Mesh open source contribution to help ensure correct configuration of your mesh) is built into Aspen Mesh and you get additional features as part of Aspen Mesh that don’t come with open source Istio Vet.

Analytics and Alerting
The Aspen Mesh platform provides insights into key metrics (latency, error rates, mTLS status) and immediate alerts so you can take action to minimize MTTD/MTTR.

Multi-cluster/Multi-cloud
See multiple clusters that live in different clouds in a single place to see what’s going on in your microservice architecture through a single pane of glass.

Canary Deploys
Aspen Mesh Experiments lets you quickly test new versions of microservices so you can qualify new versions in a production environment without disrupting users.

An Intuitive UI
Get at-a-glance views of performance and security posture as well as the ability to see service details.

Full Support
Our team of Istio experts makes it easy to get exactly what you need out of service mesh. 

You can take advantage of these features for free by signing up for Aspen Mesh Beta access.


service mesh

How The Service Mesh Space Is Like Preschool

I have a four year old son who recently started attending full day preschool. It has been fascinating to watch his interests shift from playing with stuffed animals and pushing a corn popper to playing with his science set (w00t for the STEM lab!) and riding his bike. The other kids in school are definitely informing his view of what cool new toys he needs. Undoubtedly, he could still make due with the popper and stuffed animals (he may sleep with Lambie until he's ten), but as he progresses his desire to explore new things increases.

Watching the community around service mesh develop is similar to watching my son's experience in preschool (if you're willing to make the stretch with me). People have come together in a new space to learn about cool new things, and as excited as they are, they don't completely understand the cool new things. Just as in preschool, there are a ton of bright minds that are eager to soak up new knowledge and figure out how to put it to good use.

Another parallel between my son and many of the people we talk to in the service mesh space is that they both have a long and broad list of questions. In the case of my son, it's awesome because they're questions like: "Is there a G in my name?" "What comes after Sunday?" "Does God live in the sky with the unicorns?" The questions we get from prospects and clients on service mesh are a bit different but equally interesting. It would take more time than anybody wants to spend to cover all these questions, but I thought it might be interesting to cover the top 3 questions we get from users evaluating service mesh.

What do I get with a service mesh?

We like getting this question because the answer to it is a good one. You get a toolbox that gives you a myriad of different capabilities. At a high level, what you get is observability, control and security of your microservice architecture. The features that a service mesh provide include:

  • Load balancing
  • Service discovery
  • Ingress and egress control
  • Distributed tracing
  • Metrics collection and visualization
  • Policy and configuration enforcement
  • Traffic routing
  • Security through mTLS

When do I need a service mesh?

You don't need 1,000 microservices for a service mesh to make sense. If you have nicknames for your monoliths, you're probably a ways away from needing a service mesh. And you probably don't need one if you only have 2 services, but if you have a few services and plan to continue down the microservices path it is easier to get started sooner. We are believers that containers and Kubernetes will be the way companies build infrastructure in the future, and waiting to hop on that train will only be a competitive disadvantage. Generally, we find that the answer to this question usually hinges on whether or not you are committed to cloud native. Service meshes like Aspen mesh work seamlessly with cloud native tools so the barrier to entry is low, and running cloud native applications will be much easier with the help of a service mesh.

What existing tools does service mesh allow me to replace?

This answer all depends on what functionality you want. Here's a look at tools that service mesh overlaps, what it provides and what you'll need to keep old tools for.

API gateway
Not yet. It replaces some of the functionality of a API gateway but does not yet cover all of the ingress and payment features an API gateway provides. Chances are API gateways and service meshes will converge in the future.

Tracing Tools
You get tracing capabilities as part of Istio. If you are using distributed tracing tools such as Jaeger or Zipkin, you no longer need to continue managing them separately as they are part of the Istio toolbox. With Aspen Mesh's hosted SaaS platform, we offer managed Jaeger so you don't even need to deploy or manage them.

Metrics Tools
Just like tracing, a metrics monitoring tool is included as part of Istio.With Aspen Mesh's hosted SaaS platform, we offer managed Prometheus and Grafana so you don't even need to deploy or manage them. Istio leverages Prometheus to query metrics. You have the option of visualizing them through the Prometheus UI, or using Grafana dashboards.

Load Balancing
Yep. Envoy is the sidecar proxy used by Istio and provides load balancing functionality such as automatic retries, circuit breaking, global rate limiting, request shadowing and zone local load balancing. You can use a service mesh in place of tools like HAProxy NGINX for ingress load balancing.

Security tools
Istio provides mTLS capabilities that address some important microservices security concerns. If you’re using SPIRE, you can definitely replace it with Istio which provides a more comprehensive utilisation of the SPIFFE framework. An important thing to note is that while a service mesh adds several important security features, it is not the end-all-be-all for microservices security. It’s important to also consider a strategy around network security.

If you have little ones and would be interested in comparing notes on the fantastic questions they ask, let’s chat. I'd also love to talk anything service mesh. We have been helping a broad range of customers get started with Aspen Mesh and make the most out of it for their use case. We’d be happy to talk about any of those experiences and best practices to help you get started on your service mesh journey. Leave a comment here or hit me up @zjory.


Enterprise Service Mesh

From Middleware to Containers: Infrastructure is Finally Cool

As someone fresh out of school just starting my software engineering career, I want to solve interesting problems. Who doesn’t? A computer science degree gave me the opportunity see a spectrum of different engineering opportunities, which led me to decide that working on infrastructure would be the most impactful area, and with the rise of cloud native technologies, actually a compelling space to work in. There is a difference between developing new functionality and developing to solve existing problems. More often than not, the solutions that address existing challenges in an industry are the ones the are used the most and last the longest. This is what excites me about working on infrastructure, the ability to build something that millions of people will rely on to run their applications. On the surface it doesn’t appear to be the most exciting work, but you can be sure that your time and effort is being put to good use.

You want to see your contributions make an impact somehow, whether that’s writing webapps, iPhone applications, business tools, etc. - the things that people actually use day-to-day. Infrastructure may not be as visible or as tangible as these kinds of technologies, but it’s gratifying to know that it’s the underlying piece that makes it all work. As much as I want to be able to say that I contribute to something that all of my non-tech friends can easily understand (like the front-end of Netflix), I think it’s even more interesting to make them think about the things that happen behind the scenes. We all expect our favorite apps, websites, etc. to be able to respond quickly to our requests no matter how many people are using them at the same time, but on the backend this is not something that is easy to handle and properly test for. What about security? We also expect that when we are trusting software with our information that it isn’t being easily intercepted or leaked along the way. Scalability and security are just two of many kinds of problems that software infrastructure incorporates, and in the end we are relying on them to actually make the front-end software usable. The advantage these days is that infrastructure software has become an incredibly interesting space to be in. Tools like Docker, Kubernetes and Istio are fascinating technologies with vibrant communities around them.

One of the cool, heavily used Kubernetes-related projects that I’m a fan of is Envoy. I can’t help but think about how some version of Envoy is being used every time I order a Lyft to make sure I actually get a ride. Infrastructure doesn’t seem as intriguing at first because as important it is, it’s running in the background and easily forgotten. Everyone needs it, but in the end, who wants to build it? The answer to that question is definitely changing as the infrastructure landscape evolves. Kubernetes, the OS of the cloud, has become a project that everyone wants a hand in. You don’t hear about people itching to make contributions to the Linux kernel, but you hear about Kubernetes and containers everywhere.

Coming up with solutions to solve the problems that we’re running into today has become more attractive to junior developers especially. We’re watching as more and more people are using technology every day, and like I mentioned before, we want our contributions to be impactful. How are we going to handle all of this traffic in a smooth and scalable way? Enter: distributed systems. Microservices are critical to constructing applications that can handle huge transaction volumes at scale. Enterprise applications run by companies like Lyft, Twitter and Google would fall apart with even normal rates of traffic without their distributed architectures. Working on these infrastructural pieces is challenging, and provides the impact that we, junior developers, are looking for.

Another thing that makes this work enticing to junior developers is that it involves an open source community. The way that the tech community has decided to solve some of these bigger, infrastructure-related problems has largely been through open source, which is both intimidating and inviting to those who are new to the tech industry. There is an open group of people talking about the technology and a community willing to help, but at the same time it’s daunting to contribute to these bigger open source projects when you’re just starting out. I will say, however, that the benefits of being able to leverage so many technologies and the community support make it a lot of fun to be a part of.

To recap, here are some of my favorite things about working on infrastructure:

  • We can solve some really hard problems with good infrastructure!
  • If it’s done right, you can build something that can be easily customized to solve problems of various sizes and for all kinds of use cases.
  • All of the cool things and services we consume daily rely on it. Talk about actually seeing your hard work being put to good use!
  • Whether you’re doing proprietary work or not, you are being introduced to open source and the community that comes with it.

I’ll admit, developing infrastructure, despite all of the interesting bits, is still not the most glamorous work. It’s the underlying technology that most people take for granted in their everyday use of technology, and is often less shiny than a beautifully designed UI and other components that sit on top of it. But once you dig in, it’s exciting to see what an impact you can make with it and cloud-native technologies and communities make it a fun space to work in. What I will say though is that it’s a great way to start out your career in tech, and it’s a fun, challenging, and very rewarding place to be.


Technology

Why You Should Care About Istio Gateways

If you’re breaking apart the monolith, one of the huge advantages to using Istio to manage your microservices is that it enables a configuration that leverages an ingress model that is similar to traditional load balancers and application delivery controllers. In the load balancers world, Virtual IPs and Virtual Servers have long been used as concepts that enable operators to configure ingress traffic in a flexible and scalable manner (Lori Macvittie has some relevant thoughts on this).

In Istio Gateways control the exposure of services at the edge of the mesh. Gateways allow operators to specify L4-L6 settings like port and TLS settings. For L7 settings of the Ingress traffic Istio allows you to tie Gateways to VirtualServices. This separation makes it easy to manage traffic flow into the mesh in much the same way you would tie Virtual IPs to Virtual Servers in traditional load balancers. This enables users of legacy technologies to migrate to microservices in a seamless way. It’s a natural progression for teams that are used to monoliths and edge load balancers rather than an entirely new way to think about networking configuration.

One important thing to note is that routing traffic within a service mesh and getting external traffic into the mesh are a bit different. Within the mesh, you specify exceptions from normal traffic since Istio by default (compatibility with Kubernetes) allows everything to talk to everything once inside the mesh. It’s necessary to add a policy if you don’t want certain services communicating. Getting traffic into the mesh is a reverse proxy situation (similar to traditional load balancers) where you have to specify exactly what you want allowed in.

Earlier version of Istio leveraged Kubernetes Ingress resource, but Istio v1alpha3 APIs leverage Gateways for richer functionality as Kubernetes Ingress has proven insufficient for Istio applications. Kubernetes Ingress API merges specification for L4-6 and L7, which makes it difficult for different teams in organizations with separate trust domains (like SecOps and NetOps) to own Ingress traffic management. Additionally, the Ingress API is less expressive than the routing capability that Istio provides with Envoy. The only way to do advanced routing in Kubernetes Ingress API is to add annotations for different ingress controllers. Separate concerns and trust domains within an organization warrant the need for a more capable way to manage ingress, which is provided by Istio Gateways and VirtualServices.

Once the traffic is into the mesh, it’s good to be able to provide a separation of concerns for VirtualServices so different teams can manage traffic routing for their services. L4-L6 specifications are generally something that SecOps or NetOps might be concerned with. L7 specifications are of most concern to cluster operators or application owners. So it’s critical to have the right separation of concerns. As we believe in the power of team based responsibilities, we think this is an important capability. And since we believe in the power of Istio, we are submitting an RFE in the Istio community which we hope will help enable ownership semantics for traffic management within the mesh.

We’re excited Istio has hit 1.0 and are excited to keep contributing to the project and community as it continues to evolve.

Originally published at https://thenewstack.io/why-you-should-care-about-istio-gateways/

Distributed Tracing, Istio and Your Applications

In the microservices world, distributed tracing is slowly becoming the most important tool for debugging and understanding your application dependencies. During my recent conversations in MeetUps and conferences, I found there was a lot of interest in how distributed tracing works but at the same time there was a fair amount of confusion on how tracing interacts with service meshes like Istio and Aspen Mesh. In particular, I had these following questions asked frequently:

  • How does tracing work with Istio? What information is collected and reported in the spans?
  • Do I have to change my applications to benefit from distributed tracing in Istio?
  • If I am currently reporting spans in my application how will it interact with spans reported from Istio?

In this blog I am going to try and answer these questions. Before we get deeper into these questions, a quick background on why or how I ended up writing tracing related blogs. If you follow the Aspen Mesh blog you would have noticed I wrote two blogs related to tracing, one on tracing requests to AWS services when using Istio, and the second on tracing gRPC applications with Istio.

We have a pretty small engineering team at Aspen Mesh and as it goes in most startups if you work frequently on a sub-system or component you quickly become (or labeled or assigned) a resident expert. I added tracing in our microservices and integrated it with Istio in the AWS environment and in that process uncovered various interesting interactions which I thought might be worth sharing. Over the last few months we have been using tracing very heavily to gain understanding of our microservices and it has now become the first place we look when things break. With that let's move on to answering the questions I mentioned above.

How does tracing work with Istio?

Istio injects a sidecar proxy (Envoy) in the Pod in which your application container is running. This sidecar proxy transparently intercepts (iptables magic) all network traffic going in and out of your application. Because of this interception the sidecar proxy is in a unique position to automatically trace all network requests (HTTP/1.1, HTTP/2.0 & gRPC).

Let's see what changes sidecar proxy makes to an incoming request to a Pod from a client (external or other microservices). From this point on I'm going to assume tracing headers are in Zipkin format for simplicity.

  • If the incoming request doesn't have any tracing headers, the sidecar proxy will create a root span (span where trace, parent and span IDs are all the same) before passing the request to the application container in the same Pod.
  • If the incoming request has tracing information (which should be the case if you're using Istio ingress or your microservice is being called from another microservice with sidecar proxy injected), the sidecar proxy will extract the span context from these headers, create a new sibling span (same trace, span and parent ID as incoming headers) before passing the request to the application container in the same Pod.

In the reverse directon when the application container is making outbound requests (external services or services in the cluster), the sidecar proxy in the Pod performs the following actions before making the request to the upstream service:

  • If no tracing headers are present, the sidecar proxy creates root span and injects the span context as tracing headers into the new request.
  • If tracing headers are present, the sidecar proxy extracts the span context from the headers, creates child span from this context. The new context is propagated as tracing headers in the request to the upstream service.

Based on the above explanation you should note that for every hop in your microservice chain you will get two spans reported from Istio, one from the client sidecar (span.kind set to client) and one from the server sidecar (span.kind set to server). All the spans created by the sidecars are automatically reported by the sidecars to the configured tracing backend systems like Jaeger or Zipkin.

Next let's look at the information reported in the spans. The spans contain the following information:

  • x-request-id: Reported as guid:x-request-id which is very useful in correlating access logs with spans.
  • upstream cluster: The upstream service to which the request is being made. If the span is tracking an incoming request to a Pod this is typically set to in.<name>. If the span is tracking an outbound request this is set to out.<name>.
  • HTTP headers: Following HTTP headers are reported when available:
    • URL
    • Method
    • User agent
    • Protocol
    • Request size
    • Response size
    • Response Flags
  • Start and end times for each span.
  • Tracing metadata: This includes the trace ID, span ID and the span kind (client or server). Apart from these the operation name is also reported for every span. The operation name is set to the configured virtual service (or route rule in v1alpha1) which affected the route or "default-route" if the default route was chosen. This is very useful in understanding which Istio route configuration is in effect for a span.

With that let's move on to the second question.

Do I have to change my application to gain benefit from tracing in Istio?

Yes, you will need to add logic in your application to propagate tracing headers from incoming to outgoing requests to gain full benefit from Istio's distributed tracing.

If the application container makes a new outbound request in the context of an incoming request and doesn't propagate the tracing headers from the incoming request, the sidecar proxy creates a root span for the outbound request. This means you will always see traces with only two microservices. On the other hand if the application container does propagate the tracing headers from incoming to outgoing requests, the sidecar proxy will create child spans as described above. Creation of the child spans gives you the ability to understand dependencies across multiple microservices.

There are couple of options for propagating tracing headers in your application.

  1. Look for tracing headers as mentioned in the istio docs and transfer the headers from incoming to outgoing requests. This method is simple and works in almost all cases. However, it has a major drawback that you cannot add custom tags to the spans like user information. You cannot create child spans related to events in the application which you might want to report. As you are simply transferring headers without understanding the span formats or contexts there is limited ability to add application specific information.
  2. The second method is setting up a tracing client in your application and use the Opentracing APIs to propagate tracing headers from incoming to outgoing requests. I have created a sample tracing-go package which provides an easy way to setup jaeger-client-go in your applications which is compatible with Istio. Following snippet should be included in the main function of your application:
       import (
         "log"

         "github.com/spf13/cobra"
         "github.com/spf13/viper"

         "github.com/aspenmesh/tracing-go"
       )

       func setupTracing() {
         // Configure Tracing
         tOpts := &tracing.Options{
           ZipkinURL:     viper.GetString("trace_zipkin_url"),
           JaegerURL:     viper.GetString("trace_jaeger_url"),
           LogTraceSpans: viper.GetBool("trace_log_spans"),
         }
         if err := tOpts.Validate(); err != nil {
           log.Fatal("Invalid options for tracing: ", err)
         }
         var tracer io.Closer
         if tOpts.TracingEnabled() {
           tracer, err = tracing.Configure("myapp", tOpts)
           if err != nil {
             tracer.Close()
             log.Fatal("Failed to configure tracing: ", err)
           } else {
             defer tracer.Close()
           }
         }
       }

The key point to note is in the tracing-go package I have set the Opentracing global tracer to the Jaeger tracer. This enables me to use the Opentracing APIs for propagating headers from incoming to outgoing requests like this:

   import (
     "net/http"
     "golang.org/x/net/context"
     "golang.org/x/net/context/ctxhttp"
     "ot "github.com/opentracing/opentracing-go"
   )

   func injectTracingHeaders(incomingReq *http.Request, addr string) {
     if span := ot.SpanFromContext(incomingReq.Context()); span != nil {
       outgoingReq, _ := http.NewRequest("GET", addr, nil)
       ot.GlobalTracer().Inject(
         span.Context(),
         ot.HTTPHeaders,
         ot.HTTPHeadersCarrier(outgoingReq.Header))

       resp, err := ctxhttp.Do(ctx, nil, outgoingReq)
       // Do something with resp
     }
   }

You can also use the Opentracing APIs to set span tags or create child spans
from the tracing context added by Istio like this:

   func SetSpanTag(incomingReq *http.Request, key string, value interface{}) {
     if span := ot.SpanFromContext(incomingReq.Context()); span != nil {
       span.SetTag(key, value)
     }
   }

Apart from these benefits you don't have to deal with tracing headers directly but the tracer (in this case Jaeger) handles it for you. I strongly recommend using this approach as it sets the foundation in your application to add enhanced tracing capabilities without much overhead.

Now let's move on to the third question.

How does spans reported by Istio interact with spans created by applications?

If you want the spans reported by your application to be child spans of the tracing context added by Istio you should use Opentracing API StartSpanFromContext instead of using StartSpan. The StartSpanFromContext creates a child span from the parent context if present else creates a root span.

Note that in all the examples above I have used Opentracing Go APIs but you should be able to use any tracing client library written in the same language as your application as long as it is Opentracing API compatible.


Istio 0.8 Updates

It’s here! Istio 0.8 landed on May 31, 2018. It brought a slew of new features, stability and performance improvements, and new APIs. I thought it would be helpful to share some of our favorites and experience with them.

Pilot Scalability

The first big change we are excited about in 0.8  is related to pilot scalability. Before 0.8, whenever we used Istio in clusters with more than a dozen services and more than 40-50 pods we started seeing catastrophically bad pilot performance. The earliest bug we were aware of around this was #2728 but there’s been a bunch of different manifestations. The root cause of the pain is computing the new pilot configuration when something in the cluster changes (like a new service, or a pod going away), combined with the fact that every sidecar needed to poll pilot for changes. The new alphav3 config APIs were added to pilot using the new Envoy v2 config APIs which uses a push model (documented here and here).  And that work has paid off!  

We were really happy when we deployed 0.8 in our test cluster and scaled services up and down, nuked and paved large apps, did everything we can think of. If you previously tried Istio and got a bit of a queasy feeling around behaviors when you grew your service mesh beyond an example app, now might be the time to give it another shot. (Istio 1.0 isn’t too far off, either).

Sidecar Injection

Istio is built on the sidecar model - every pod in the mesh has a dataplane proxy running right alongside it to add the service mesh smarts. All traffic to and from the application container goes through the sidecar first. The application container should be unaware - the less the application has to be coupled to the sidecar, the better.

But something has to make sure the sidecar (and its friend, the init container) get started alongside the app. Previous versions of Istio let you inject manually or at the time that the Kubernetes Deployment was created. The drawback of Deployment-time injection comes when you want to upgrade your service mesh to run a newer version of the sidecar.  You either have to patch the Deployments in your Kubernetes cluster (“manually” injecting the upgrade), or delete and recreate them.

With 0.8, we can use a MutatingWebhook Pod Admission Controller - although the title is a mouthful, the end result is a pretty neat example of Kubernetes extensibility. Whenever Kubernetes is about to create a pod, it lets its pod admission controllers take a look and decide to allow, reject or allow-with-changes that pod. Every (properly authenticated and configured) admission controller gets a shot.  Istio provides one of these admission controllers that injects the current sidecar and init container. (It’s called a MutatingWebhook because it can mutate the pod to add the sidecar instead of only accept/reject, and it runs in some other service reachable via webhook aside from the Kubernetes API server.)

We like this because it makes it easier for us to upgrade our service mesh. We’re happiest if all of our pods are running the most current version of the sidecar and that version correlates to the release of the Istio control plane version we’ve deployed.  We don’t like to think about different pods using different (older) sidecars if we don’t have to. Now, once we install an upgraded control plane, we trigger some sort of rolling upgrade on the pods behind each service and they pick up the new sidecar.

There are multiple ways to say “Hey Istio, please inject a sidecar into this” or “Hey Istio, please leave this alone”. We use namespace annotations to turn it on and then pod metadata annotations to disable it for particular pods if needed.

v1alpha3 APIs and composability

Finally, a bit about new APIs. Istio 0.8 introduces a bunch of new Kubernetes resources for configuration. Before, we had what are called “v1alpha1” resources like RouteRules. Now, we have “v1alpha3” resources like DestinationPolicies and VirtualServices. 0.8 supports both v1alpha1 and v1alpha3 resources as a migration point from v1alpha1 to v1alpha3. The future is v1alpha3 config resources so we should all be porting any of our existing config.

We found this process to be not too difficult for our existing configs. Minor note - we all think of incoming traffic first but don’t forget to also port Egress config to ServiceEntries or opt-out IP ranges. Personally, even though it can be a bit surprising at first to have to configure external access, we really like the control and visibility we get from managing Egresses with ServiceEntries for production environments.

One of the big changes in the new v1alpha3 APIs is that you define the routing config for a service all in one place like the VirtualService custom resource.  Before, you could define this config in multiple RouteRules that were ordered by precedence. A downside of the multiple RouteRules approach is that if you want to model what’s going to happen in your head, you’ve got to read and sort different RouteRule resources. The new approach is definitely simpler, but the tradeoff for that simplicity is some difficulty if you actually want your config to come from multiple resources.

We think this kind of thing can happen when you have multiple teams contributing to a larger service mesh and the apps on it (think a Platform Ops team contributing “base” policy, a Platform Security team contributing some policy on top, and finally an individual app team fine-tuning their particular routes). We’re not sure what the solution is, but my colleague Neeraj started the conversation on the istio-dev list and we’re looking forward to seeing where it goes in the future.

Better external service support

It used to be the case that if you wanted a service in the mesh to communicate with a TLS service outside the mesh, you had to modify your service to speak http over port 443, so that istio could route it correctly. Now that istio can use SNI to route traffic, you can leave your service alone, and configure istio to allow it to communicate with that external service by hostname. Since this is TLS passthrough, you don’t get L7 visibility of the egress traffic, but since you don’t need to modify your service, it allows you to add services to the mesh that you might not have been able to before.

We see great potential with the added features and increased stability in 0.8. We’re looking forward to seeing what is in store with 1.0 and are excited to see how teams use Istio now that it seems ready for production deployments.


Tracing and Metrics: Getting the Most Out of Istio

Are you considering or using a service mesh to help manage your microservices infrastructure? If so, here are some basics on how a service mesh can help, the different architectural options, and tips and tricks on using some key CNCF tools that integrate well with Istio to get the most out of it.

The beauty of a service mesh is that it bundles so many capabilities together, freeing engineering teams from having to spend inordinate amounts of time managing microservices architectures. Kubernetes has solved many build and deploy challenges, but it is still time consuming and difficult to ensure reliability and security at runtime. A service mesh handles the difficult, error-prone parts of cross-service communication such as latency-aware load balancing, connection pooling, service-to-service encryption, instrumentation, and request-level routing.

Once you have decided a service mesh makes sense to help manage your microservices, the next step is deciding what service mesh to use. There are several architectural options, from the earliest model of a library approach, the node agent architecture, and the model which seems to be gaining the most traction – the sidecar model. We have also seen an evolution from data plane proxies like Envoy, to service meshes such as Istio which provide distributed control and data planes. We're active users of Istio, and believers in the sidecar architecture striking the right balance between a robust set of features and a lightweight footprint, so let’s take a look at how to get the most out of tracing and metrics with Istio.

Tracing

One of the capabilities Istio provides is distributed tracing. Tracing provides service dependency analysis for different microservices and it provides tracking for requests as they are traced through multiple microservices. It’s also a great way to identify performance bottlenecks and zoom into a particular request to define things like which microservice contributed to the latency of a request or which service created an error.

We use and recommend Jaeger for tracing as it has several advantages:

  • OpenTracing compatible API
  • Flexible & scalable architecture
  • Multiple storage backends
  • Advanced sampling
  • Accepts Zipkin spans
  • Great UI
  • CNCF project and active OS community

Metrics

Another powerful thing you gain with Istio is the ability to collect metrics. Metrics are key to understanding historically what has happened in your applications, and when they were healthy compared to when they were not. A service mesh can gather telemetry data from across the mesh and produce consistent metrics for every hop. This makes it easier to quickly solve problems and build more resilient applications in the future.

We use and recommend Prometheus for gathering metrics for several reasons:

  • Pull model
  • Flexible query API
  • Efficient storage
  • Easy integration with Grafana
  • CNCF project and active OS community

We also use Cortex, which is a powerful tool to enhance Prometheus. Cortex provides:

  • Long term durable storage
  • Scalable Prometheus query API
  • Multi-tenancy

Check out this webinar for a deeper look into what you can do with these tools and more.


Service Mesh Security: Addressing Attack Vectors with Istio

As you break apart your monolith into microservices, you'll gain a slew of advantages such as scalability, increased uptime and better fault isolation. A downside of breaking applications apart into smaller services is that there is a greater area for attack. Additionally, all the communication that used to take place via function calls within the monolith is now exposed to the network. Adding security that addresses this must be a core consideration on your microservices journey.

One of the key benefits of Istio, the open source service mesh that Aspen Mesh is built on, is that it provides unique service mesh security and policy enforcement to microservices. An important thing to note is that while a service mesh adds several important security features, it is not the end-all-be-all for microservices security. It’s important to also consider a strategy around network security (a good read on how the network can help manage microservices), which can detect and neutralize attacks on the service mesh infrastructure itself, to ensure you’re entirely protected against today’s threats.

So let’s look at the attack vectors that Istio addresses, which include traffic control at the edge, traffic encryption within the mesh and layer-7 policy control.

Security at the Edge
Istio adds a layer of security that allows you to monitor and address compromising traffic as it enters the mesh. Istio integrates with Kubernetes as an ingress controller and takes care of load balancing for ingress. This allows you to add a level of security at the perimeter with ingress rules. You can apply monitoring around what is coming into the mesh and use route rules to manage compromising traffic at the edge.

To ensure that only authorized users are allowed in, Istio’s Role-Based Access Control (RBAC) provides flexible, customizable control of access at the namespace-level, service-level and method-level for services in the mesh. RBAC provides two distinct capabilities: the RBAC engine watches for changes on RBAC policy and fetches the updated RBAC policy if it sees any changes, and authorizes requests at runtime, by evaluating the request context against the RBAC policies, and returning the authorization result.

Encrypting Traffic
Security at the edge is a good start, but if a malicious actor gets through, Istio provides defense with mutual TLS encryption of the traffic between your services. The mesh can automatically encrypt and decrypt requests and responses, removing that burden from the application developer. It can also improve performance by prioritizing the reuse of existing, persistent connections, reducing the need for the computationally expensive creation of new ones.

Istio provides more than just client server authentication and authorization, it allows you to understand and enforce how your services are communicating and prove it cryptographically. It automates the delivery of the certificates and keys to the services, the proxies use them to encrypt the traffic (providing mutual TLS), and periodically rotates certificates to reduce exposure to compromise. You can use TLS to ensure that Istio instances can verify that they’re talking to other Istio instances to prevent man-in-the-middle attacks.

Istio makes TLS easy with Citadel, the Istio Auth controller for key management. It allows you to secure traffic over the wire and also make strong identity-based authentication and authorization for each microservice.

Policy Control and Enforcement
Istio gives you the ability to enforce policy at the application level with layer-7 level control. Applying policy at the this level is ideal for service routing, retries, circuit-breaking, and for security that operates at the application layer, such as token validation. Istio provides the ability to set up whitelists and blacklists so you can let in what you know is safe and keep out what you know isn’t.

Istio’s Mixer enables integrating extensions into the system and lets you declare policy constraints on network, or service behavior, in a standardized expression language. The benefit is that you can funnel all of those things through a common API which enables you to cache policy decisions at the edge of the service so, if the downstream policy systems start to fail, the network stays up.

Istio addresses some key concerns that arise with microservices. You can make sure that only the services that are supposed to talk to each other are talking to each other. You can encrypt those communications to secure against attacks that can occur when those services interact, and you can apply application-wide policy. While there are other, manual, ways to accomplish much of this, the beauty of a mesh is that is brings several capabilities together and lets you apply them in a manner that is scalable.

At Aspen Mesh, we’re working on some new capabilities to help you get the most out of the security features in Istio. We’ll be posting something on that in the near future so check back in on the Aspen Mesh blog.


Observability, or "Knowing What Your Microservices Are Doing"

Microservicin’ ain’t easy, but it’s necessary. Breaking your monolith down into smaller pieces is a must in a cloud native world, but it doesn’t automatically make everything easier. Some things actually become more difficult. An obvious area where it adds complexity is communications between services; observability into service to service communications can be hard to achieve, but is critical to building an optimized and resilient architecture.

The idea of monitoring has been around for a while, but observability has become increasingly important in a cloud native landscape. Monitoring aims to give an idea of the overall health of a system, while observability aims to provide insights into the behavior of systems. Observability is about data exposure and easy access to information which is critical when you need a way to see when communications fail, do not occur as expected or occur when they shouldn’t. The way services interact with each other at runtime needs to be monitored, managed and controlled. This begins with observability and the ability to understand the behavior of your microservice architecture.

A primary microservices challenges is trying to understand how individual pieces of the overall system are interacting. A single transaction can flow through many independently deployed microservices or pods, and discovering where performance bottlenecks have occurred provides valuable information.

It depends who you ask, but many considering or implementing a service mesh say that the number one feature they are looking for is observability. There are many other features a mesh provides, but those are for another blog. Here, I’m going to cover the top observability features provided by a service mesh.

Tracing

An overwhelmingly important things to know about your microservices architecture is specifically which microservices are involved in an end-user transaction. If many teams are deploying their dozens of microservices, all independently of one another, it’s difficult to understand the dependencies across your services. Service mesh provides uniformity which means tracing is programming-language agnostic, addressing inconsistencies in a polyglot world where different teams, each with its own microservice, can be using different programming languages and frameworks.

Distributed tracing is great for debugging and understanding your application’s behavior. The key to making sense of all the tracing data is being able to correlate spans from different microservices which are related to a single client request. To achieve this, all microservices in your application should propagate tracing headers. If you’re using a service mesh like Aspen Mesh, which is built on Istio, the ingress and sidecar proxies automatically add the appropriate tracing headers and reports the spans to a tracing collector backend. Istio provides distributed tracing out of the box making it easy to integrate tracing into your system. Propagating tracing headers in an application can provide nice hierarchical traces that graph the relationship between your microservices. This makes it easy to understand what is happening when your services interact and if there are any problems.

Metrics

A service mesh can gather telemetry data from across the mesh and produce consistent metrics for every hop. Deploying your service traffic through the mesh means you automatically collect metrics that are fine-grained and provide high level application information since they are reported for every service proxy. Telemetry is automatically collected from any service pod providing network and L7 protocol metrics. Service mesh metrics provide a consistent view by generating uniform metrics throughout. You don’t have to worry about reconciling different types of metrics emitted by various runtime agents, or add arbitrary agents to gather metrics for legacy apps. It’s also no longer necessary to rely on the development process to properly instrument the application to generate metrics. The service mesh sees all the traffic, even into and out of legacy “black box” services, and generates metrics for all of it.

Valuable metrics that a service mesh gathers and standardizes include:

  • Success Rates
  • Request Volume
  • Request Duration
  • Request Size
  • Request and Error Counts
  • Latency
  • HTTP Error Codes

These metrics make it simpler to understand what is going on across your architecture and how to optimize performance.

Most failures in the microservices space occur during the interactions between services, so a view into those transactions helps teams better manage architectures to avoid failures. Observability provided by a service mesh makes it much easier to see what is happening when your services interact with each other, making it easier to build a more efficient, resilient and secure microservice architecture.