Why You Want Idempotency Anyway

We've been talking about how you can use a service mesh to do progressive delivery lately.  Progressive delivery fundamentally is about decoupling software delivery from user activation of said software.  Once decoupled, the user activation portion is under business control. It's early days, but the promise here is that software engineering can build new stuff as fast as they can, and your customer success team (already keeping a finger on the pulse of the users and market) can independently choose when to introduce what new features (and associated risk) to whom.

We've demonstrated how you can use Flagger (a progressive delivery Kubernetes operator) to command a service mesh to do canary deploys.  These help you activate functionality for only a subset of traffic and then progressively increase that subset as long as the new functionality is healthy.  We're also working on an enhancement to Flagger to do traffic mirroring. I think this is really cool because it lets you try out a feature activation without actually exposing any users to the impact.  It's a "pre-stage" to a canary deployment: Send one copy to the original service, one copy to the canary, and check if the canary responds as well as the original.

There's a caveat we bring up when we talk about this, however: Idempotency.  You can only do traffic mirroring if you're OK with duplicating requests, and sending one to the primary and one to the canary.  If your app and infrastructure is OK with duplicating these requests, they are said to be idempotent.

Idempotency

Idempotency is the ability to apply an operation more than once and not change the result.  In math, we'd say that:

f(f(a)) = f(a)

An example of a mathematical function that's idempotent is ABS(), the absolute value.

ABS(-3.7) = 3.7
ABS(ABS(3.7)) = ABS(3.7) = 3.7

We can repeat taking the absolute value of something as many times as we want, it won't change any more after the first time.  Similar things in your math toolbox: CEIL(), FLOOR(), ROUND(). But SQRT() is not in general idempotent. SQRT(16) = 4, SQRT(SQRT(16)) = 2, and so on.

For web apps, idempotent requests are those that can be processed twice without causing something invalid to happen.  So read-only operations like HTTP GETs are always idempotent (as long as they're actually only reads). Some kinds of writes are also idempotent: suppose I have a request that says "Set andrews_timezone to MDT".  If that request gets processed twice, my timezone gets set to MDT twice. That's OK. The first one might have changed it from PDT to MDT, and the second one "changes" it from MDT to MDT, so no change. But in the end, my timezone is MDT and so I'm good.

An example of a not-idempotent request is one that says "Deduct $100 from andrews_account".  If we apply that request twice, then my account will actually have $200 deducted from it and I don't want that.  You remember those e-commerce order pages that say "Don't click reload or you may be billed twice"? They need some idempotency!

This is important for traffic mirroring because we're going to duplicate the request and send one copy to the primary and one to the canary.  While idempotency is great for enabling this traffic mirroring case, I'm here to tell you why it's a great thing to have anyway, even if you're never going to do progressive delivery.

Exactly-Once Delivery and Invading Generals

There's a fundamental tension that emerges if you have distributed systems that communicate over an unreliable channel.  You can never be sure that the other side received your message exactly once.  I'll retell the parable of the Invading Generals as it was told to me the first time I was sure I had designed a system that solved this paradox.

There is a very large invading army that has camped in a valley for the night.  The defenders are split into two smaller armies that surround the invading army; one set of defenders on the eastern edge of the valley and one on the west.  If both defending armies attack simultaneously, they will defeat the invaders. However, if only one attacks, it will be too small; the invaders will defeat it, and then turn to the remaining defenders at their leisure and defeat them as well.

The general for the eastern defenders needs to coordinate with the general for the western defenders for simultaneous attack.  Both defenders can send spies with messages through the valley, as many as they want. Each spy has a 10% chance of being caught by the invaders and killed before his message is delivered, and a 90% chance of successfully delivering the message.  You are the general for the eastern defenders: what message will you send to the western defenders to guarantee you both attack simultaneously and defeat the invaders?

Turns out, there is no guaranteed safe approach.  Let's go through it. First, let's send this message:

"Western General, I will attack at dawn and you must do the same."

- Eastern General

There's a 90% chance that your spy will get through, but there's a 10% chance that he won't, only you will attack and you will lose to the invaders.  The problem statement said you have infinite spies, we must be able to do better!

OK, let's send lots of spies with the same message.  Then our probability of success is 1-0.1^n, where n is the number of spies.  So we can asymptotically approach 100% probability that the other side agrees, but we can never be sure.

How about this message:

"Western General, I am prepared to attack at dawn.  Send me a spy confirming that you have received this message so I know you will also attack.  If I don't receive confirmation, I won't attack because my army will be defeated if I attack alone."

- Eastern General

Now, if you don't receive a spy back from the western general you'll send another, and another, until you get a response.  But.... put yourself in the shoes of the western general. How does the western general know that you'll receive the confirmation spy?  Should the western army attack at dawn? What if the confirmation spy was caught and now only the western army attacks, ensuring defeat?

The western general could send lots of confirmation spies, so there is a high probability that at least one gets through.  But they can't guarantee with 100% probability that one gets through.

The western general could also send this response:

"Eastern General, we have received your spy.  We are also prepared to attack at dawn. We will be defeated if you do not also attack, and I know you won't attack if you don't know that we have received your message.  Please send back a spy confirming that you have received my confirmation or else we will not attack because we will be destroyed."

 

- Western General

A confirmation of a confirmation! (In networking ARQ terms, an ACK-of-an-ACK).  Again, this can reduce probability but cannot provide guarantees: we can keep shifting uncertainty between the Eastern and Western generals but never eliminate it.

Engineering Approaches

Okay, we can't know for sure that our message is delivered exactly once (regardless of service mesh or progressive delivery or any of that), so what are we going to do?  There are a few approaches:

• Retry naturally-idempotent requests

• Uniquefy requests

• Conditional updates

• Others

Retry Naturally-Idempotent Requests

If you have a request that is naturally idempotent, like getting the temperature on a thermostat, the end user can just repeat it if they didn't get the response they want.

Uniqueify Requests

Another approach is to make requests unique at the client, and then have all the other services avoid processing the same unique request twice.  One way to do this is to invent a UUID at the client and then have servers remember all the UUIDs they've already seen. My deduction request would then look like:

This is unique request f41182d1-f4b2-49ec-83cc-f5a8a06882aa.
If you haven't seen this request before, deduct $100 from andrews_account.

Then you can submit this request as many times as you want to the processor, and the processor can check if it's handled "f41182d1-f4b2-49ec-83cc-f5a8a06882aa" before.  There are a few caveats here.

First you have to have a way to generate unique identifiers.  UUIDs are pretty good but theoretically there's an extremely small possibility of UUID collision; practically there's a couple of minor foot-guns to watch out for like generating UUIDs on two VMs or containers that both have fake virtual MAC addresses that match.  You can also have the server make the unique identifier for you (it could be an auto-generated primary key in a database that is guaranteed to be unique).

Second your server has to remember all the UUIDs that you have processed.  Typically you put these in a database (maybe using UUID as a primary key anyway).  If the record of processed UUIDs is different than the action you take when processing, there's still a "risk window": you might commit a UUID and then fail to process it, or you might process it and fail to commit the UUID.  Algorithms like two-phase commit and paxos can help close the risk window.

Conditional Updates

Another approach is to include information in the request about what things looked like when the client sent the request, so that the server can abort the request if something has changed.  This includes the case that the "change" is a duplicate request and we've already processed it.

For instance, maybe my bank ledger looks like this:

Then I would make my request look like:

As long as the last transaction in andrews_account is number 563,
Create entry 564: Deduct $100 from andrews_account

If this request gets duplicated, the first will succeed and the second will fail.  After the first:

The duplicated request will fail:

As long as the last transaction in andrews_account is number 563,
Create entry 564: Deduct $100 from andrews_account

In this case the server could respond to the first copy with "Success" and the second copy with a soft failure like "Already committed" or just tell the client to read and notice that its update already happened.  MongoDB, AWS Dynamo and others support these kinds of conditional updates.

Others

There are many practical approaches to this problem.  I recommend doing some initial reasoning about idempotency, and then try to shift as much functionality as you can to the database or persistent state layer you're using.  While I gave a quick tour of some of the things involved in idempotency, there are a lot of other tricks like write-ahead journalling, conflict-free replicated data types and others that can enhance reliability.

Conclusion

Traffic mirroring is a great way to exercise canaries in a production environment before exposing them to your users.  Mirroring makes a duplicate of each request and sends one copy to the primary, one copy to the new canary version of your microservice.  This means that you must use mirroring only for idempotent requests: requests that can be applied twice without causing something erroneous to happen.

This caveat probably exists even if you aren't doing traffic mirroring, because networks fail.  The Eastern General and Western General can never really be sure their messages are delivered exactly once, there will always be a case where they may have to retry.  I think you want to build idempotency wherever possible, and then you should use traffic mirroring to test your canary deployments.


Service Mesh Insider: An Interview with Shawn Wormke

Have you ever wondered how we got to service mesh? What backgrounds, experiences and technologies led to the emergence of service mesh? 

We recently put together an interview with Aspen Mesh’s Founder, Shawn Wormke in order to get the inside scoop for you. Read on to find out the answers to these three questions:

  1. What role did your technical expertise play in how Aspen Mesh focuses on enterprise service mesh?
  2. Describe how your technical and business experience combined to create an enterprise startup and inform your understanding of how to run a “modern” software company?
  3. What characteristics define a “modern” enterprise, and how does Aspen Mesh contribute to making it a reality?

1. What role did your technical expertise play in how Aspen Mesh focuses on the enterprise?

I started my career at Cisco working in network security and firewalls on the ASA product line and later the Firewall Services Module for the Catalyst 6500/7600 series switches and routers. Both of these products were focused on the enterprise at a time when security was starting to move up the stack and become more and more distributed throughout the network. We were watching our customers move from L2 transparent firewalls to L3/L4 firewalls that required application logic in order to “fixup” dynamic ports for protocols like FTP, SIP and H.323. Eventually that journey up the stack continued to L7 firewalls that were doing URL, header and payload inspection to enforce security policy.

At the same time that this move up the stack was happening, customers were starting to look at migrating workloads to VMs and were demanding new form factors and valuing different performance metrics. No longer were speeds, feeds and dragstrip numbers important, the focus was shifting to footprint and elasticity. The result in this shift in priority was a change in mindset when it came to how enterprises were thinking about expenses. They started to think about shifting expenses from large capacity stranding CAPEX purchases to more frequent OPEX transactions that were aligned with a software-first approach.

It was this idea that led me to join as one of the first engineers at a small startup in Boulder, CO called LineRate Systems which was eventually acquired by F5 Networks. The company was founded on a passion for making high performance, lightweight application delivery (aka load balancing) software that was as fast as the industry standard hardware. Our realization was that Commodity Off the Shelf (COTS) hardware had so much performance that if leveraged properly it was possible to offer the same performance at a much lower cost.

But the big idea, the one that ultimately got us noticed by F5, was that if the hardware was freely available (everyone had racks and racks of servers), we could charge our customers for a performance range and let them stamp out the software--as much as they needed--to achieve that. This removed the risk of the transaction from the customer as they no longer had to pre-plan 3-5 years worth of capacity.  It placed the burden on the provider to deliver an efficient and API-first elastic platform and a pricing model that scaled along the same dimensions as their business needs.

After acquisition we started to use containers and eventually Kubernetes for some of our build and test infrastructure. The use of these technologies led us to realize that they were great for increasing velocity and agility, but were difficult to debug and secure. We had no record of what our test containers did or who they talked to at runtime and we had no idea what data they were accessing. If we had a way to make sense of all of this, life would be so much easier.

This led us to work on some internal projects that experimented with ideas that we all now know as service mesh. We even released a product that was the beginning of this called the Application Services Proxy, which we ultimately end-of-lifed in 2017 when we made the decision to create Aspen Mesh.

In 2018 Aspen Mesh was born as an F5 Incubation. It is a culmination of almost 20 years of solving network and security problems for some of the world's largest enterprise customers and ensuring that the form-factor, consumption and pricing models are flexible and grow along with the businesses that use it. It is acknowledgement that disruption is happening everywhere and that an organization’s agility and ability to respond to disruption is it's number one business asset. Companies are realizing this agility by redefining how they deliver value to their customers as quickly as possible using technologies like cloud, containers and Kubernetes.

We know that for enterprises, agility with stability is the number one competitive advantage. Through years of experience working on enterprise products we know that companies who can meet their evolving customer needs--while staying out of the news for downtime and security breaches--will be the winners of tomorrow. Aspen Mesh’s Enterprise Service Mesh enables enterprises to rapidly deliver value to their customers in a performant, secure and compliant way.

2. Describe how your technical and learned business experience combine to build an enterprise startup and inform your understanding of how best to run a “modern” software company?

Throughout my career I have been part of waterfall to agile transformations, worked on products that enabled business agility and now run a team that requires that same flexibility and business agility. We need to be focused on getting product to market that shows value to our customers as quickly as possible. We rely on automation to ensure that we are focusing our human capital on the most important tasks. We rely on data to make our decisions and ensure that the data we have is trustworthy and secure.

The great thing is that we get to be the ones doing the disrupting, and not the ones getting disrupted. What this means is we get to move fast and don’t have the burden of a large enterprise decision-making process. We can be agile and make mistakes, and we are actually expected to make mistakes. We are told "no" more than we are told "yes." But, learning from those failures and making course corrections along the way is key to our success.

Over the years I have come to embrace the power of open source and what it can do to accelerate projects and the impacts (both positive and negative) it can have on your company. I believe that in the future all applications will be born from open technologies. Companies that acknowledge and embrace this will be the most successful in the new cloud-native and open source world. How you choose to do that depends on your business and business model. You can be a consumer of OSS in your SaaS platform, an open-core product, glue multiple projects together to create a product or provide support and services; but if you are not including open source in your modern software company, you will not be successful.

Over the past 10 years we have seen and continue to see consumption models across all verticals rapidly evolve from perpetual NCR-based sales models with annual maintenance contracts to subscription or consumption based models to fully managed SaaS based offerings. I recently read an article on subscription based banking. This is driven from the desire to shift the risk to the producer instead of the consumer. It is a realization by companies that capacity planning for 3-5 years is impossible, and that laying out that cash is a huge risk to the business they are no longer willing to take. It is up to technology producers to provide enough value to attract customers and then continue providing value to them to retain them year over year.

Considering how you are going to offer your product in a way that scales with your customers value matrix and growth plans is critical. This applies to pricing as well as product functionality and performance.

Finally, I would be negligent if I didn’t mention data as a paramount consideration when running a modern software company. Insights derived from that data need to be at the center of everything you do. This goes not only for your product, but also your internal visibility and decision making processes. 

On the product side, when dealing with large enterprises it is critical to understand what your customers are willing to give you and how much value they need to realize in return. An enterprise's first answer will often be “no” when you tell them you need to access their data to run your product, but that just means you haven’t shown them enough value to say "yes." You need to consider what data you need, how much you need of it, where it will be stored and how you are protecting it.

On the internal side you need to measure everything. The biggest challenge I have found with an early-stage, small team is taking the time to enable these measurements. It is easy to drop off the list when you are trying to get new features out the door and you don’t yet know what you're going to do with the data. Resist that urge and force your teams to think about how they can do both, and if necessary take the time to slow down and put it in. Sometimes being thoughtful early on can help you go fast later, and adding hooks to gather and analyze data is one of those times.

Operating a successful modern software company requires you embrace all the cliches about wearing multiple hats and failing fast. It's also critical to focus on being agile, embrace open source, create a consumption based offering, and rely on data, data, data and more data.

3. What characteristics define a “modern” enterprise, and how does Aspen Mesh contribute to making it a reality?

The modern enterprise craves agility and considers it to be their number one business advantage. This agility is what allows the enterprise to deliver value to customers as quickly as possible. This agility is often derived from a greater reliance on technology to enable rapid speed to market. Enterprises are constantly defending against disruption from non-traditional startup companies with seemingly unlimited venture funding and no expectation of profitability. All the while the enterprise is required to compete and deliver value while maintaining the revenue and profitability goals that their shareholders have grown to expect over years of sustained growth. 

In order to remain competitive, enterprises are embracing new business models and looking for new ways to engage their customers through new digital channels. They are relying more on data and analytics to make business decisions and to make their teams and operations more efficient. Modern enterprises are embracing automation to perform mundane repetitive tasks and are turning over their workforce to gain the technical talent that allows them to compete with the smaller upstart disruptors in their space.

But agility without stability can be detrimental to an enterprise. As witnessed by many recent reports, enterprises can struggle with challenges around data and data security, perimeter breaches and downtime. It's easy to get caught up in the promise of the latest new technology, but moving fast and embracing new technology requires careful consideration for how it integrates into your organization, it's security posture and how it scales with your business. Finding a trusted partner to accompany you on your transformation journey is key to long term success.

Aspen Mesh is that technology partner when it comes to delivering next generation application architectures based on containers and Kubernetes. We understand the power and promise of agility and scalability that these technologies offer, but we also know that they introduce a new set of challenges for enterprises. These challenges include securing communication between services, observing and controlling service behavior and problems and managing the policy associated with services across large distributed organizations. 

Aspen Mesh provides a fully supported service mesh that is focused on enterprise use cases that include:

  • An advanced policy framework that allows users to describe business goals that are enforced in the application’s runtime environment
  • Role based policy management that enables organizations to create and apply policies according to their needs
  • A catalog of policies based on industry and security best practices that are created and tested by experts
  • Data analytics-based insights for enhanced troubleshooting and debugging
  • Predictive analytics to help teams detect and mitigate problems before they happen
  • Streamlined application deployment packages that provide a uniform approach to authentication and authorization, secure communications, and ingress and egress control
  • DevOps tools and workflow integration
  • A simplified user experience with improved organization and streamlined navigation to enable users to quickly find and mitigate failures and security issues
  • A consistent view of applications across multiple clouds to allow visibility from a global perspective to a hyper-local level
  • Graph visualizations of application relationships that enable teams to collaborate seamlessly on focused subsets of their infrastructure
  • Tabular representations surfacing details to find and remediate issues across multiple clusters running dozens or hundreds of services
  • A reduced-risk scalable consumption model that allows customers to pay as they grow

Thanks for reading! We hope that helps shed some light on what goes on behind the scenes at Aspen Mesh. And if you liked this post, feel free to subscribe to our blog in order to get updates when new articles are released.


Understanding Service Mesh

The Origin of Service Mesh

In the beginning, we had packets and packet-switched networks.

Everyone on the Internet — all 30 of them — used packets to build addressing, session establishment/teardown. And then, they’d need a retransmission scheme. Then, they’d build an ordered byte stream out of it.

Eventually, they realized they had all built the same thing. The RFCs for IP and TCP standardized this, operating systems provided a TCP/IP stack, so no application ever had to turn a best-effort packet network into a reliable byte stream.

We took our reliable byte streams and used them to make applications. Turns out that a lot of those applications had common patterns again — they requested things from servers, and then got responses. So, we separated these request/responses into metadata (headers) and body.

HTTP standardized the most widely deployed request/response protocol. Same story. App developers don't have to implement the mechanics of requests and responses. They can focus on the app on top.

There's a newer set of functionality that you need to build a reliable microservices application. Service discovery, versioning, zero-trust.... all the stuff popularized by the Netflix architecture, by 12-factor apps, etc. We see the same thing happening again - an emerging set of best practices that you have to build into each microservice to be successful.

So, service mesh is about putting all that functionality again into a layer, just like HTTP, TCP, packets, that's underneath your code, but creating a network for services rather than bytes.

Questions? Download The Complete Guide to Service Mesh or keep reading to find out more about what exactly a service mesh is.

What Is A Service Mesh?

A service mesh is a transparent infrastructure layer that sits between your network and application.

It’s designed to handle a high volume of service-to-service communications using application programming interfaces (APIs). A service mesh ensures that communication among containerized application services is fast, reliable and secure.

The mesh provides critical capabilities including service discovery, load balancing, encryption, observability, traceability, authentication and authorization, and the ability to control policy and configuration in your Kubernetes clusters.

Service mesh helps address many of the challenges that arise when your application is being consumed by the end user. Being able to monitor what services are communicating with each other, if those communications are secure, and being able to control the service-to-service communication in your clusters are key to ensuring applications are running securely and resiliently.

More Efficiently Managing Microservices

The self-contained, ephemeral nature of microservices comes with some serious upside, but keeping track of every single one is a challenge — especially when trying to figure out how the rest are affected when a single microservice goes down. The end result is that if you’re operating or developing in a microservices architecture, there’s a good chance part of your days are spent wondering what the hell your services are up to.

With the adoption of microservices, problems also emerge due to the sheer number of services that exist in large systems. Problems like security, load balancing, monitoring and rate limiting that had to be solved once for a monolith, now have to be handled separately for each service.

Service mesh helps address many of these challenges so engineering teams, and businesses, can deliver applications more quickly and securely.

Why You Might Care

If you’re reading this, you’re probably responsible for making sure that you and your end users get the most out of your applications and services. In order to do that, you need to have the right kind of access, security and support. That’s probably why you started down the microservices path.

If that’s true, then you’ve probably realized that microservices come with their own unique challenges, such as:

  1. Increased surface area that can be attacked
  2. Polyglot challenges
  3. Controlling access for distributed teams developing on a single application

That’s where a service mesh comes in.

A service mesh is an infrastructure layer for microservices applications that can help reduce the complexity of managing microservices and deployments by handling infrastructure service communication quickly, securely and reliably. 

Service meshes are great at solving operational challenges and issues when running containers and microservices because they provide a uniform way to secure, connect and monitor microservices. 

Here’s the point: a good service mesh keeps your company’s services running they way they should. A service mesh designed for the enterprise, like Aspen Mesh, gives you all the observability, security and traffic management you need — plus access to engineering and support, so you can focus on adding the most value to your business.

And that is good news for DevOps.

The Rise of DevOps - and How Service Mesh Is Enabling It

It’s happening, and it’s happening fast.

Companies are transforming internal orgs and product architectures along a new axis of performance. They’re finding more value in iterations, efficiency and incremental scaling, forcing them to adopt DevOps methodologies. This focus on time-to-market is driving some of the most cutting-edge infrastructure technology that we have ever seen. Technologies like containers and Kubernetes, and a focus on stable, consistent and open APIs allow small teams to make amazing progress and move at the speeds they require. These technologies have reduced the friction and time to market.

The adoption of these technologies isn’t perfect, and as companies deploy them at scale, they realize that they have inadvertently increased complexity and de-centralized ownership and control. In many cases, it’s challenging to understand the entire system.

A service mesh enables DevOps teams by helping manage this complexity. It provides autonomy and freedom for development teams through a stable and scalable platform, while simultaneously providing a way for platform teams to enforce security, policy and compliance standards.

This empowers your development teams to make choices based on the problems they are solving rather than being concerned with the underlying infrastructure. Dev teams now have the freedom to deploy code without the fear of violating compliance or regulatory guidelines, and platform teams can put guardrails in place to ensure your applications are secure and resilient.

Want to learn more? Get the Complete Guide to Service Mesh here.


Why Is Policy Hard?

Aspen Mesh spends a lot of time talking to users about policy, even if we don’t always start out calling it that. A common pattern we see with clients is:

  1. Concept: "Maybe I want this service mesh thing"
  2. Install: "Ok, I've got Aspen Mesh installed, now what?"
  3. Observe: "Ahhh! Now I see how my microservices are communicating.  Hmmm, what's that? That pod shouldn't be talking to that database!"
  4. Act: "Hey mesh, make sure that pod never talks to that database"

The Act phase is interesting, and there’s more to it than might be obvious at first glance. I'll propose that in this blog, we work through some thought experiments to delve into how service mesh can help you act on insights from the mesh.

First, put yourself in the shoes of the developer that just found out their test pod is accidentally talking to the staging database. (Ok, you're working from home today so you don't have to put on shoes; the cat likes sleeping on your shoeless feet better anyways.) You want to control the behavior of a narrow set of software for which you're the expert; you have local scope and focus.

Next, put on the shoes of a person responsible for operating many applications; people we talk to often have titles that include Platform, SRE, Ops, Infra. Each day they’re diving into different applications so being able to rapidly understand applications is key. A consistent way of mapping across applications, datacenters, clouds, etc. is critical. Your goal is to reduce "snowflake architecture" in favor of familiarity to make it easier when you do have to context switch.

Now let's change into the shoes of your org's Compliance Officer. You’re on the line for documenting and proving that your platform is continually meeting compliance standards. You don't want to be the head of the “Department of No”, but what’s most important to you is staying out of the headlines. A great day at work for you is when you've got clarity on what's going on across lots of apps, databases, external partners, every source of data your org touches AND you can make educated tradeoffs to help the business move fast with the right risk profile. You know it’s ridiculous to be involved in every app change, so you need separation-of-concerns.

I'd argue that all of these people have policy concerns. They want to be able to specify their goals at a suitably high level and leave the rote and repetitive portions to an automated system.  The challenging part is there's only one underlying system ("the kubernetes cluster") that has to respond to each of these disparate personas.

So, to me policy is about transforming a bunch of high-level behavioral prescriptions into much lower-level versions through progressive stages. Useful real-world policy systems do this in a way that is transparent and understandable to all users, and minimizes the time humans spend coordinating. Here's an example "day-in-the-life" of a policy:

At the top is the highest level goal: "Devs should test new code without fear". Computers are hopeless to implement this. At the bottom is a rule suitable for a computer like a firewall to implement.

The layers in the middle are where a bad policy framework can really hurt. Some personas (the hypothetical Devs) want to instantly jump to the bottom - they're the "4.3.2.1" in the above example. Other personas (the hypothetical Compliance Officer) is way up top, going down a few layers but not getting to the bottom on a day-to-day basis.

I think the best policy frameworks help each persona:

  • Quickly find the details for the layer they care about right now.
  • Help them understand where did this come from? (connect to higher layers)
  • Help them understand is this doing what I want? (trace to lower layers)
  • Know where do I go to change this? (edit/create policy)

As an example, let's look at iptables, one of the firewalling/packet mangling frameworks for Linux.  This is at that bottom layer in my example stack - very low-level packet processing that I might look at if I'm an app developer and my app's traffic isn't doing what I'd expect.  Here's an example dump:


root@kafka-0:/# iptables -n -L -v --line-numbers -t nat
Chain PREROUTING (policy ACCEPT 594K packets, 36M bytes)
num   pkts bytes target     prot opt in out   source destination
1     594K 36M ISTIO_INBOUND  tcp -- * * 0.0.0.0/0            0.0.0.0/0

Chain INPUT (policy ACCEPT 594K packets, 36M bytes)
num   pkts bytes target     prot opt in out   source destination

Chain OUTPUT (policy ACCEPT 125K packets, 7724K bytes)
num   pkts bytes target     prot opt in out   source destination
1      12M 715M ISTIO_OUTPUT  tcp -- * * 0.0.0.0/0            0.0.0.0/0

Chain POSTROUTING (policy ACCEPT 12M packets, 715M bytes)
num   pkts bytes target     prot opt in out   source destination

Chain ISTIO_INBOUND (1 references)
num   pkts bytes target     prot opt in out   source destination
1        0 0 RETURN     tcp -- * *     0.0.0.0/0 0.0.0.0/0            tcp dpt:22
2     594K 36M RETURN     tcp -- * *   0.0.0.0/0 0.0.0.0/0            tcp dpt:15020
3        2 120 ISTIO_IN_REDIRECT  tcp -- * * 0.0.0.0/0            0.0.0.0/0

Chain ISTIO_IN_REDIRECT (1 references)
num   pkts bytes target     prot opt in out   source destination
1        2 120 REDIRECT   tcp -- * *     0.0.0.0/0 0.0.0.0/0            redir ports 15006

Chain ISTIO_OUTPUT (1 references)
num   pkts bytes target     prot opt in out   source destination
1      12M 708M ISTIO_REDIRECT  all -- * lo 0.0.0.0/0           !127.0.0.1
2        7 420 RETURN     all -- * *     0.0.0.0/0 0.0.0.0/0            owner UID match 1337
3        0 0 RETURN     all -- * *     0.0.0.0/0 0.0.0.0/0            owner GID match 1337
4     119K 7122K RETURN     all -- * *   0.0.0.0/0 127.0.0.1
5        4 240 ISTIO_REDIRECT  all -- * * 0.0.0.0/0            0.0.0.0/0

Chain ISTIO_REDIRECT (2 references)
num   pkts bytes target     prot opt in out   source destination
1      12M 708M REDIRECT   tcp -- * *   0.0.0.0/0 0.0.0.0/0            redir ports 15001


This allows me to quickly understand a lot of details about what is happening at this layer. Each rule specification is on the right-hand side and is relatively intelligible to the personas that operate at this layer. On the left, I get "pkts" and "bytes" - this is a count of how many packets have triggered each rule, helping me answer "Is this doing what I want it to?". There's even more information here if I'm really struggling: I can log the individual packets that are triggering a rule, or mark them in a way that I can capture them with tcpdump.  

Finally, furthest on the left in the "num" column is a line number, which is necessary if I want to modify or delete rules or add new ones before/after; this is a little bit of help for "Where do I go to change this?". I say a little bit because in most systems that I'm familiar with, including the one I grabbed that dump from, iptables rules are produced by some program or higher-level system; they're not written by a human. So if I just added a rule, it would only apply until that higher-level system intervened and changed the rules (in my case, until a new Pod was created, which can happen at any time). I need help navigating up a few layers to find the right place to effect the change.

iptables lets you organize groups of rules into your own chains, in this case the name of the chain (ISTIO_***) is a hint that Istio produced this and so I've got a hint on what higher layer to examine.

For a much different example, how about the Kubernetes CI Robot (from Prow)? If you've ever made a PR to Kubernetes or many other CNCF projects, you likely interacted with this robot. It's an implementer of policy; in this case the policies around changing source code for Kubernetes.  One of the policies it manages is compliance with the Contributor's License Agreement; contributors agree to grant some Intellectual Property rights surrounding their contributions. If k8s-ci-robot can't confirm that everything is alright, it will add a comment to your PR:

This is much different than firewall policy, but I say it's still policy and I think the same principles apply. Let's explore. If you had to diagram the policy around this, it would start at the top with the legal principle that Kubernetes wants to make sure all the software under its umbrella has free and clear IP terms. Stepping down a layer, the Kubernetes project decided to satisfy that requirement by requiring a CLA for any contributions. So on until we get to the bottom layer, the code that implements the CLA check.

As an aside, the code that implements the CLA check is actually split into two halves: first there's a CI job that actually checks the commits in the PR against a database of signed CLAs, and then there's code that takes the result of that job and posts helpful information for users to resolve any issues. That's not visible or important at that top layer of abstraction (the CNCF lawyers shouldn't care).

This policy structure is easy to navigate. If your CLA check fails, the comment from the robot has great links. If you're an individual contributor you can likely skip up a layer, sign the CLA and move on. If you're contributing on behalf of a company, the links will take you to the document you need to send to your company's lawyers so they can sign on behalf of the company.

So those are two examples of policy. You probably encounter many other ones every day from corporate travel policy to policies (written, unwritten or communicated via email missives) around dirty dishes.

It's easy to focus on the technical capabilities of the lowest levels of your system. But I'd recommend that you don't lose focus on the operability of your system. It’s important that it be transparent and easy to understand. Both the iptables and k8s-ci-robot are transparent. The k8s-ci-robot has an additional feature: it knows you're probably wondering "Where did this come from?" and it answers that question for you. This helps you and your organization navigate the layers of policy. 

When implementing service mesh to add observability, resilience and security to your Kubernetes clusters, it’s important to consider how to set up policy in a way that can be navigated by your entire team. With that end in mind, Aspen Mesh is building a policy framework for Istio that makes it easy to implement policy and understand how it will affect application behavior.

Did you like this blog? Subscribe to get email updates when new Aspen Mesh blogs go live.


From NASA to Service Mesh

The New Stack recently published a podcast featuring our CTO, Andrew Jenkins discussing How Service Meshes Found a Former Space Dust Researcher. In the podcast, Andrew talks about how he moved from working on electrical engineering and communication protocols for NASA to software and finally service mesh development here at Aspen Mesh.

“My background is in electrical engineering, and I used to work a lot more on the hardware side of it, but I did get involved in communication, almost from the physical layer, and I worked on some NASA projects and things like that,” said Jenkins. “But then my career got further and further up into the software side of things, and I ended up at a company called F5 Networks. [Eventually] this ‘cloud thing’ came along, and F5 started seeing a lot of applications moving to the cloud. F5 offers their product in a version that you use in AWS, so what I was working on was an open source project to make a Kubernetes ingress controller for the F5 device. That was successful, but what we saw was that a lot of the traffic was shifting to the inside of the Kubernetes cluster. It was service-to-service communication from all these tiny things--these microservices--that were designed to be doing business logic. So this elevated the importance of communication...and that communication became very important for all of those tiny microservices to work together to deliver the final application experience for developers. So we started looking at that microservice communication inside and figuring out ways to make that more resilient, more secure and more observable so you can understand what’s going on between your applications.”

In addition, the podcast covers the evolution of service mesh, more details about tracing and logging, canaries, Kubernetes, YAML files and other surrounding technologies that extend service mesh to help simplify microservices management.

“I hope service meshes become the [default] way to deal with distributed tracing or certificate rotation. So, if you have an application, and you want it to be secure, you have to deal with all these certs, keys, etc.,” Jenkins said. “It’s not impossible, but when you have microservices, you do not have to do it a whole lot more times. So that’s why you get this better bang for the buck by pushing that down into that service mesh layer where you don’t have to repeat it all the time.”

To listen to the entire podcast, visit The News Stack’s post.

Interested in reading more articles like this? Subscribe to the Aspen Mesh blog:


Securing Containerized Applications With Service Mesh

The self-contained, ephemeral nature of microservices comes with some serious upside, but keeping track of every single one is a challenge, especially when trying to figure out how the rest are affected when a single microservice goes down. The end result is that if you’re operating or developing a microservices architecture, there’s a good chance part of your days are spent wondering what your services are up to.

With the adoption of microservices, problems also emerge due to the sheer number of services that exist in large systems. Problems like security, load balancing, monitoring and rate limiting that had to be solved once for a monolith, now have to be handled separately for each service.

The technology aimed at addressing these microservice challenges has been  rapidly evolving:

  1. Containers facilitate the shift from monolith to microservices by enabling independence between applications and infrastructure.
  2. Container orchestration tools solve microservices build and deploy issues, but leave many unsolved runtime challenges.
  3. Service mesh addresses runtime issues including service discovery, load balancing, routing and observability.

Securing services with a service mesh

A service mesh provides an advanced toolbox that lets users add security, stability and resiliency to containerized applications. One of the more common applications of a service mesh is bolstering cluster security. There are 3 distinct capabilities provided by the mesh that enable platform owners to create a more secure architecture.

Traffic Encryption  

As a platform operator, I need to provide encryption between services in the mesh. I want to leverage mTLS to encrypt traffic between services. I want the mesh to automatically encrypt and decrypt requests and responses, so I can remove that burden from my application developers. I also want it to improve performance by prioritizing the reuse of existing connections, reducing the need for the computationally expensive creation of new ones. I also want to be able to understand and enforce how services are communicating and prove it cryptographically.

Security at the Edge

As a platform operator, I want Aspen Mesh to add a layer of security at the perimeter of my clusters so I can monitor and address compromising traffic as it enters the mesh. I can use the built in power of Kubernetes as an ingress controller to add security with ingress rules such as whitelisting and blacklisting. I can also apply service mesh route rules to manage compromising traffic at the edge. I also want control over egress so I can dictate that our network traffic does not go places it shouldn't (blacklist by default and only talk to what you whitelist).

Role Based Access Control (RBAC)

As the platform operator, It’s important that I am able to provide the level of least privilege so the developers on my platform only have access to what they need, and nothing more. I want to enable controls so app developers can write policy for their apps and only their apps so that they can move quickly without impacting other teams. I want to use the same RBAC framework that I am familiar with to provide fine-grained RBAC within my service mesh.

How a service mesh adds security

You’re probably thinking to yourself, traffic encryption and fine-grained RBAC sound great, but how does a service mesh actually get me to them? Service meshes that leverage a sidecar approach are uniquely positioned intercept and encrypt data. A sidecar proxy is a prime insertion point to ensure that every service in a cluster is secured, and being monitored in real-time. Let’s explore some details around why sidecars are a great place for security.

Sidecar is a great place for security

Securing applications and infrastructure has always been daunting, in part because the adage really is true: you are only as secure as your weakest link.  Microservices are an opportunity to improve your security posture but can also cut the other way, presenting challenges around consistency.  For example, the best organizations use the principle of least privilege: an app should only have the minimum amount of permissions and privilege it needs to get its job done.  That's easier to apply where a small, single-purpose microservice has clear and narrowly-scoped API contracts.  But there's a risk that as application count increases (lots of smaller apps), this principle can be unevenly applied. Microservices, when managed properly, increase feature velocity and enable security teams to fulfill their charter without becoming the Department of No.

There's tension: Move fast, but don't let security coverage slip through the cracks.  Prefer many smaller things to one big monolith, but secure each and every one.  Let each team pick the language of their choice, but protect them with a consistent security policy.  Encourage app teams to debug, observe and maintain their own apps but encrypt all service-to-service communication.

A sidecar is a great way to balance these tensions with an architecturally sound security posture.  Sidecar-based service meshes like Istio and Linkerd 2.0 put their datapath functionality into a separate container and then situate that container as close to the application they are protecting as possible.  In Kubernetes, the sidecar container and the application container live in the same Kubernetes Pod, so the communication path between sidecar and app is protected inside the pod's network namespace; by default it isn't visible to the host or other network namespaces on the system.  The app, the sidecar and the operating system kernel are involved in communication over this path.  Compared to putting the security functionality in a library, using a sidecar adds the surface area of kernel loopback networking inside of a namespace, instead of just kernel memory management.  This is additional surface area, but not much.

The major drawbacks of library approaches are consistency and sprawl in polyglot environments.  If you have a few different languages or application frameworks and take the library approach, you have to secure each one.  This is not impossible, but it's a lot of work.  For each different language or framework, you get or choose a TLS implementation (perhaps choosing between OpenSSL and BoringSSL).  You need a configuration layer to load certificates and keys from somewhere and safely pass them down to the TLS implementation.  You need to reload these certs and rotate them.  You need to evaluate "information leakage" paths: does your config parser log errors in plaintext (so it by default might print the TLS key to the logs)?  Is it OK for app core dumps to contain these keys?  How often does your organization require re-keying on a connection?  By bytes or time or both?  Minimum cipher strength?  When a CVE in OpenSSL comes out, what apps are using that version and need updating?  Who on each app team is responsible for updating OpenSSL, and how quickly can they do it?  How many apps have a certificate chain built into them for consuming public websites even if they are internal-only?  How many Dockerfiles will you need to update the next time a public signing authority has to revoke one?  slowloris?

Your organization can do all this work.  In fact, parts probably already have - above is our list of painful app security experiences but you probably have your own additions.  It is a lot of cross-organizational effort and process to get it right.  And you have to get it right everywhere, or your weakest link will be exploited.  Now with microservices, you have even more places to get it right.  Instead, our advice is to focus on getting it right once in the sidecar, and then distributing the sidecar everywhere, and get back to adding business value instead of duplicating effort.

There are some interesting developments on the horizon like the use of kernel TLS to defer bulk and some asymmetric crypto operations to the kernel.  That's great:  Implementations should change and evolve.  The first step is providing a good abstraction so that apps can delegate to lower layers. Once that's solid, it's straightforward to move functionality from one layer to the next as needed by use case, because you don't perturb the app any more.  As precedent, consider TCP Segmentation Offload, which lets the network card manage splitting app data into the correct size for each individual packet.  This task isn't impossible for an app to do, but it turns out to be wasted effort.  By deferring TCP segmentation to the kernel, it left the realm of the app.  Then, kernels, network drivers, and network cards were free to focus on the interoperability and semantics required to perform TCP segmentation at the right place.  That's our position for this higher-level service-to-service communication security: move it outside of the app to the sidecar, and then let sidecars, platforms, kernels and networking hardware iterate.

Envoy Is a Great Sidecar

We use Envoy as our sidecar because it's lightweight, has some great features and good API-based configurability.  Here are some of our favorite parts about Envoy:

  • Configurable TLS Parameters: Envoy exposes all the TLS configuration points you'd expect (cipher strength, protocol versions, curves).  The advantage to using Envoy is that they're configured the same way for every app using the sidecar.
  • Mutual TLS: Typically TLS is used to authenticate the server to the client, and to encrypt communication.  What's missing is authenticating the client to the server - if you do this, then the server knows what is talking to it.  Envoy supports this bi-directional authentication out of the box, which can easily be incorporated into a SPIFFE system.  In today's complex and cloud datacenter, you're better off if you trust things based on cryptographic proof of what they are, instead of network perimeter protection of where they called from.
  • BoringSSL: This fork of OpenSSL removed huge amounts of code like implementations of obsolete ciphers and cleaned up lots of vestigial implementation details that had repeatedly been the source of security vulnerabilities.  It's a good default choice if you don't need any OpenSSL-specific functionality because it's easier to get right.
  • Security Audit: A security audit can't prove the absence of vulnerabilities but it can catch mistakes that demonstrate either architectural weaknesses or implementation sloppiness.  Envoy's security audit did find issues but in our opinion indicated a high level of security health.
  • Fuzzed and Bountied: Envoy is continuously fuzzed (exposed to malformed input to see if it crashes) and covered by Google's Patch Reward security bug bounty program.
  • Good API Granularity: API-based configuration doesn't mean "just serialize/deserialize your internal state and go."  Careful APIs thoughtfully map to the "personas" of what's operating them (even if those personas are other programs).  Envoy's xDS APIs in our experience partition routing behavior from cluster membership from secrets.  This makes it easy to make well-partitioned controllers.  A knock-on benefit is that it is easy in our experience to debug and test Envoy because config constructs usually map pretty clearly to code constructs.
  • No garbage collector: There are great languages with automatic memory management like Go that we use every day.  But we find languages like C++ and Rust provide predictable and optimizable tail latency.
  • Native Extensibility via Filters: Envoy has layer 4 and layer 7 extension points via filters that are written in C++ and linked into Envoy.
  • Scripting Extensibility via Lua: You can write Lua scripts as extension points as well.  This is very convenient for rapid prototyping and debugging.

One of these benefits deserves an even deeper dive in a security-oriented discussion.  The API granularity of Envoy is based on a scheme called "xDS" which we think of as follows:  Logically split the Envoy config API based on the user of that API.  The user in this case is almost always some other program (not a human), for instance a Service Mesh control plane element.

For instance, in xDS listeners ("How should I get requests from users?") are separated from clusters ("What pods or servers are available to handle requests to the shoppingcart service?").  The "x" in "xDS" is replaced with whatever functionality is implemented ("LDS" for listener discovery service).  Our favorite security-related partitioning is that the Secret Discovery Service can be used for propagating secrets to the sidecars independent of the other xDS APIs.

Because SDS is separate, the control plane can implement the Principle of Least Privilege: nothing outside of SDS needs to handle or have access to any private key material.

Mutual TLS is a great enhancement to your security posture in a microservices environment.  We see mutual TLS adoption as gradual - almost any real-world app will have some containerized microservices ready to join the service mesh and mTLS on day one.  But practically speaking, many of these will depend on mesh-external services, containerized or not.  It is possible in most cases to integrate these services into the same trust domain as the service mesh, and oftentimes these components can even participate in client TLS authentication so you get true mutual TLS.

In our experience, this happens by gradually expanding the "circle" of things protected with mutual TLS.  First, stateless containerized business logic, next in-cluster third party services, finally external state stores like bare metal databases.  That's why we focus on making the state of mTLS easy to understand in Aspen Mesh, and provide assistants to help you detect configuration mishaps.

What lives outside the sidecar?

You need a control plane to configure all of these sidecars.  In some simple cases it may be tempting to do this with some CI integration to generate configs plus DNS-based discovery.  This is viable but it's hard to do rapid certificate rotation.  Also, it leaves out more dynamic techniques like canaries, progressive delivery and A/B testing.  For this reason, we think most real-world applications will include an online control plane that should:

  • Disseminate configuration to each of the sidecars with a scalable approach.
  • Rotate sidecar certificates rapidly to reduce the value to an attacker of a one-time exploit of an application.
  • Collect metadata on what is communicating with what.

A good security posture means you should be automating some work on top of the control plane. We think these things are important (and built them into Aspen Mesh):

  • Organizing information to help humans narrow in on problems quickly.
  • Warning on potential misconfigurations.
  • Alerting when unhealthy communication is observed.
  • Inspect the firehose of metadata for surprises - these patterns could be application bugs or security issues or both.

If you’re considering or going down the Kubernetes path, you should be thinking about the unique security challenges that comes with microservices running in a Kubernetes cluster. Kubernetes solves many of these, but there are some critical runtime issues that a service mesh can make easier and more secure. If you would like to talk about how the Aspen Mesh platform and team can address your specific security challenge, feel free to find some time to chat with us.


The Complete Guide to Service Mesh

What’s Going On In The Service Mesh Universe?

Service meshes are relatively new, extremely powerful and can be complex. There’s a lot of information out there on what a service mesh is and what it can do, but it’s a lot to sort through. Sometimes, it’s helpful to have a guide. If you’ve been asking questions like “What is a service mesh?” “Why would I use one?” “What benefits can it provide?” or “How did people even come up with the idea for service mesh?” then The Complete Guide to Service Mesh is for you.

Check out the free guide to find out:

  • The service mesh origin story
  • What a service mesh is
  • Why developers and operators love service mesh
  • How a service mesh enables DevOps
  • Problems a service mesh solves

The Landscape Right Now

A service mesh overlaps, complements, and in some cases, replaces many tools that are commonly used to manage microservices. Last year was all about evaluating and trying out service meshes. But while curiosity about service mesh is still at a peak, enterprises are already in the evaluation and adoption process.

The capabilities service mesh can add to ease managing microservices applications at runtime are clearly exciting to early adopters and companies evaluating service mesh. Conversations tell us that many enterprises are already using microservices and service mesh, and many others are planning to deploy in the next six months. And if you’re not yet sure about whether or not you need a service mesh, check out the recent Gartner, 451 and IDC reports on microservices — all of which say a service mesh will be mandatory by 2020 for any organization running microservices in production.

Get Started with Service Mesh

Are you already using Kubernetes and Istio? You might be ready to get started using a service mesh. Download Aspen Mesh here or contact us to talk with a service mesh expert about getting set up for success.

Get the Guide

Fill out the form below to get your copy of The Complete Guide to Service Mesh.


To Multicluster, or Not to Multicluster: Solving Kubernetes Multicluster Challenges with a Service Mesh

If you are going to be running multiple clusters for dev and organizational reasons, it’s important to understand your requirements and decide whether you want to connect these in a multicluster environment and, if so, to understand various approaches and associated tradeoffs with each option.

Kubernetes has become the container orchestration standard, and many organizations are currently running multiples clusters. But while communication issues within clusters are largely solved, communication across clusters is still a major challenge for most organizations.

Service mesh helps to address multicluster challenges. Start by identifying what you want, then shift to how to get it. We recommend understanding your specific communication use case, identifying your goals, then creating an implementation plan.

Multicluster offers a number of benefits:

  • Single pane of glass
  • Unified trust domain
  • Independent fault domains
  • Intercluster traffic
  • Heterogenous/non-flat network

Which can be achieved with various approaches:

  • Independent clusters
  • Common management
  • Cluster-aware service routing through gateways
  • Flat network
  • Split-horizon Endpoints Discovery Service (EDS)

If you have decided to multicluster, your next move is deciding the best implementation method and approach for your organization. A service mesh like Istio can help, and when used properly can make multicluster communication painless.

Read the full article here on InfoQ’s site.

 


Expanding Service Mesh Without Envoy

Istio uses the Envoy sidecar proxy to handle traffic within the service mesh.  The following article describes how to use an external proxy, F5 BIG-IP, to integrate with an Istio service mesh without having to use Envoy for the external proxy.  This can provide a method to extend the service mesh to services where it is not possible to deploy an Envoy proxy.

This method could be used to secure a legacy database to only allow authorized connections from a legacy app that is running in Istio, but not allow any other applications to connect.

Securing Legacy Protocols

A common problem that customers face when deploying a service mesh is how to restrict access to an external service to a limited set of services in the mesh.  When all services can run on any nodes it is not possible to restrict access by IP address (“good container” comes from the same IP as “malicious container”).

One method of securing the connection is to isolate an egress gateway to a dedicated node and restrict traffic to the database from those nodes.  This is described in Istio’s documentation:

Istio cannot securely enforce that all egress traffic actually flows through the egress gateways. Istio only enables such flow through its sidecar proxies. If attackers bypass the sidecar proxy, they could directly access external services without traversing the egress gateway. Thus, the attackers escape Istio’s control and monitoring. The cluster administrator or the cloud provider must ensure that no traffic leaves the mesh bypassing the egress gateway.

   -- https://istio.io/docs/examples/advanced-gateways/egress-gateway/#additional-security-considerations (2019-03-25)

Another method would be to use mesh expansion to install Envoy onto the VM that is hosting your database. In this scenario the Envoy proxy on the database server would validate requests prior to forwarding them to the database.

The third method that we will cover will be to deploy a BIG-IP to act as an egress device that is external to the service mesh.  This is a hybrid of mesh expansion and multicluster mesh.

Mesh Expansion Without Envoy

Under the covers Envoy is using mutual TLS to secure communication between proxies.  To participate in the mesh, the proxy must use certificates that are trusted by Istio; this is how VM mesh expansion and multicluster service mesh are configured with Envoy.  To use an alternate proxy we need to have the ability to use certificates that are trusted by Istio.

Example of Extending Without Envoy

A proof-of-concept of extending the mesh can be taken with the following example.  We will create an “echo” service that is TCP based that will live outside of the service mesh.  The goal will be to restrict access to only allow authorized “good containers” to connect to the “echo” service via the BIG-IP.  The steps involved.

  1. Retrieve/Create certificates trusted by Istio
  2. Configure external proxy (BIG-IP) to use trusted certificates and only trust Istio certificates
  3. Add policy to external proxy to only allow “good containers” to connect
  4. Register BIG-IP device as a member of the Istio service mesh
  5. Verify that “good container” can connect to “echo” and “bad container” cannot

First we install a set of certificates on the BIG-IP that Envoy will trust and configure the BIG-IP to only allow connections from Istio.  The certs could either be pulled directly from Kubernetes (similar to setting up mesh expansion) or generated by a common CA that is trusted by Istio (similar to multicluster service mesh).

Once the certs are retrieved/generated we install them onto the proxy, BIG-IP, and configure the device to only trust client side certificates that are generated by Istio.

To enable a policy to validate the identity of the “good container” we will inspect the X509 Subject Alternative Name fields of the client certificate to inspect the spiffe name that contains the identity of the container.

Once the external proxy is configured we can register the device using “istioctl register” (similar to mesh expansion).

To verify that our test scenario is working we will have two namespaces “default” and “trusted”.  Connections from “trusted” will be allowed and “default” will be reject.  From each namespace we create a pod and run the command “nc bigip.default.svc.cluster.local 9000”.  Looking at our BIG-IP logs we can verify that our policy (iRule) worked:

Mar 25 18:56:39 ip-10-1-1-7 info tmm5[17954]: Rule /Common/log_cert <CLIENTSSL_CLIENTCERT>: allowing: spiffe://cluster.local/ns/trusted/sa/sleep
Mar 25 18:57:00 ip-10-1-1-7 info tmm2[17954]: Rule /Common/log_cert <CLIENTSSL_CLIENTCERT>: rejecting spiffe://cluster.local/ns/default/sa/default

Connection from our “good container”

/ # nc bigip.default.svc.cluster.local
9000
hi
HI

Connection from our “bad container”

# nc bigip.default.svc.cluster.local 9000

In the case of the “bad container” we are unable to connect.  The “nc”, netcat, command is simulating a very basic TCP client.  A more realistic example would be connecting to an external database that contains sensitive data.  In the “good” example we are echo’ing back the capitalized input (“hi” becomes “HI”).

Just One Example

In this article we looked at expanding a service mesh without Envoy.  This was focused on egress TCP traffic, but it could be expanded to:

  • Using BIG-IP as an SNI proxy instead of NGINX
  • Securing inbound traffic using mTLS and/or JWT tokens
  • Using BIG-IP as an ingress gateway
  • Using ServiceEntry/DestinationRules instead of registered service

If you want to see the process in action, check out this short video walkthrough.

https://youtu.be/83GdmwTvWLI

Let me know in the comments whether you’re interested in any of these use-cases or come-up with your own.  Thank you!


Why Service Meshes, Orchestrators Are Do or Die for Cloud Native Deployments

The self-contained, ephemeral nature of microservices comes with some serious upside, but keeping track of every single one is a challenge, especially when trying to figure out how the rest are affected when a single microservice goes down. The end result is that if you’re operating or developing in a microservices architecture, there’s a good chance part of your days are spent wondering what the hell your services are up to.

With the adoption of microservices, problems also emerge due to the sheer number of services that exist in large systems. Problems like security, load balancing, monitoring and rate limiting that had to be solved once for a monolith, now have to be handled separately for each service.

The good news is that engineers love a good challenge. And almost as quickly as they are creating new problems with microservices, they are addressing those problems with emerging microservices tools and technology patterns. Maybe the emergence of microservices is just a smart play by engineers to ensure job security.

Today’s cloud native darling, Kubernetes, eases many of the challenges that come with microservices. Auto-scheduling, horizontal scaling and service discovery solve the majority of build-and-deploy problems you’ll encounter with microservices.

What Kubernetes leaves unsolved is a few key containerized application runtime issues. That’s where a service mesh steps in. Let’s take a look at what Kubernetes provides, and how Istio adds to Kubernetes to solve the microservices runtime issues.

Kubernetes Solves Build-and-Deploy Challenges

Managing microservices at runtime is a major challenge. A service mesh helps alleviate this challenge by providing observability, control and security for your containerized applications. Aspen Mesh is the fully supported distribution of Istio that makes service mesh simple and enterprise-ready.

Kubernetes supports a microservice architecture by enabling developers to abstract away the functionality of a set of pods, and expose services to other developers through a well-defined API. Kubernetes enables L4 load balancing, but it doesn’t help with higher-level problems, such as L7 metrics, traffic splitting, rate limiting and circuit breaking.

Service Mesh Addresses Challenges of Managing Traffic at Runtime

Service mesh helps address many of the challenges that arise when your application is being consumed by the end user. Being able to monitor what services are communicating with each other, if those communications are secure and being able to control the service-to-service communication in your clusters are key to ensuring applications are running securely and resiliently.

Istio also provides a consistent view across a microservices architecture by generating uniform metrics throughout. It removes the need to reconcile different types of metrics emitted by various runtime agents, or add arbitrary agents to gather metrics for legacy un-instrumented apps. It adds a level of observability across your polyglot services and clusters that is unachievable at such a fine-grained level with any other tool.

Istio also adds a much deeper level of security. While Kubernetes only provides basic secret distribution and control-plane certificate management, Istio provides mTLS capabilities so you can encrypt on the wire traffic to ensure your service-to-service communications are secure.

A Match Made in Heaven

Pairing Kubernetes with a service mesh-like Istio gives you the best of both worlds and since Istio was made to run on Kubernetes, the two work together seamlessly. You can use Kubernetes to manage all of your build and deploy needs and Istio takes care of the important runtime issues.

Kubernetes has matured to a point that most enterprises are using it for container orchestration. Currently, there are 74 CNCF-certified service providers — which is a testament to the fact that there is a large and growing market. I see Istio as an extension of Kubernetes and a next step to solving more challenges in what feels like a single package.

Already, Istio is quickly maturing and is starting to see more adoption in the enterprise. It’s likely that in 2019 we will see Istio emerge as the service mesh standard for enterprises in much the same way Kubernetes has emerged as the standard for container orchestration.