Recent security vulnerabilities require Zero-Trust Security tactics for your microservices environment

Despite significant technological advancements, security is still hard. A single phishing email, missed patch, or misconfiguration can let the bad guys in to wreak havoc or steal data. For companies moving to the cloud and the cloud-native architecture of microservices and containerized applications, it’s even harder. Now, in addition to the perimeter and the network itself, the myriad connections between microservice containers must also be protected. 

With microservices, the surface area of your network vulnerable to attack increases exponentially, putting data at greater risk. Moreover, network-related problems like access control, load balancing, and monitoring that had to be solved only once for a monolithic application now must be solved separately for each service within a cluster, as well as between clusters. 

Zero Trust Security Methodology and Networking Principles

Zero-Trust dates to the 1990’s as a method for “Perimeter-less” security. The main concept behind the methodology is “never trust, always verify” even if the network was previously verified.

  • Networks should always be considered hostile 
  • Network locality is not sufficient for deciding trust in a network 
  • Every device, user, and request should be authenticated and authorized 
  • Network policies must be dynamic and calculated from as many sources of data as possible

Today, it’s essential to apply a Zero-Trust approach to network security and to service mesh technology. In our white paper just completed, Zero-Trust Security for your Microservices Architecture, we outline what it takes to implement the key tenets of Zero-Trust security using a service mesh to secure a microservices environment. In the paper we provide the steps to mitigate cyberattacks to protect containerized applications. 

What's covered in our white paper, Zero-Trust Security for your Microservices Architecture:
  1. Zero-Trust authentication methodology for a service mesh
  2. mTLS encryption: Achieve non-repudiation for requests without requiring any changes or support from the applications. Identity, certificates, and authorization to ensure “every device, user, and request is authenticated and authorized” -- a Zero Trust principle
  3. Learn the built-in methods Istio uses to combat security vulnerabilities
  4. Ingress and Egress security control within a service mesh

Lastly, in the paper we touch on Aspen Mesh’s approach to Zero-Trust security, including how we configure mTLS, secure ingress, monitor egress, prevent RBAC (Role Based Access Control) misconfigurations and apply policy and configuration best practices.

Aspen Mesh has deep expertise in Istio and understands how to get the most out of it - our Services and 24/7 Service Mesh Support are unmatched in the industry.

- Andy

Adopting a Zero-Trust Approach to Security for Containerized Applications

Adopting a zero-trust secure service mesh can help remove the burden of addressing security requirements from your application development teams, freeing them to focus on functions that provide direct value to your customers. Find out how in this whitepaper along with:

photo of magnifying glass

Getting the Most Out of Your Service Mesh

The Aspen Mesh team knows that service mesh has broad implications and benefits whether you're a product owner, a software developer, or an operations leader. Someone in Dev is going to have very different questions than someone in Ops. And an App Owner is going to want to better understand things like a service mesh’s impact on the bottom line.

This guide will help you understand the benefits no matter your role in your organization.

photo of compass

The Complete Guide to Service Mesh

Service meshes are new, extremely powerful and can be complex. If you’ve been asking questions like “What is a service mesh?” “Why would I use one?” “What benefits can it provide?” or “How did people even come up with the idea for service mesh?” then The Complete Guide to Service Mesh is for you.

Check out the free guide to find out:

photo of circuit board

Service mesh update: Maintainers add features while practitioners push federation (451 Research)

Service mesh update: Maintainers add features while practitioners push federation (451 Research)

Analysts – Jean Atelsek, William Fellows 

Publication date: Wednesday, December 4 2019 


Cloud adopters are enthusiastic about the promise of service mesh to consistently apply routing, policy and encryption across microservices-based applications, but implementation has been difficult due to fiddly configuration and management demands. Add to this competing control plane options – Istio, Consul, Kuma, Linkerd, NSX and AWS’s proprietary App Mesh – at various stages of adoption and maturity, and you get a perfect storm of confusion; dare we say a bit of a ‘service mess.’ This is to be expected at the current stage of market development. It’s a market that is being made up as we go – it is thrashing, crowded and complex. There’s lots of confusion; clean, simple stories will be successful here. At KubeCon 2019 in San Diego, maintainers introduced tools to make their offerings easier to love, while practitioners cited the need for an open standard that can federate various preferences across environments. 

The 451 Take

Service mesh was a prominent topic at this year’s KubeCon North America, complete with its own Day Zero event (ServiceMeshCon), a CNCF roundtable and a raft of announcements from project maintainers. In a show of hands, about 10% of attendees to the sold-out ServiceMeshCon said they had experience with service mesh in production – about 50% had tried it out. The landscape seems to be branching out in several directions, with open source projects adding tools to ease adoption of their control planes, vendors hoping to capitalize on service mesh difficulty by offering to run it as a service on behalf of enterprises, and other participants promising to reconcile the various offerings with the help of an overarching specification. While few dispute the need for a way to route, monitor and authenticate traffic for service-to-service communications, the way forward for most organizations is far from clear, indicating opportunity as well as risk, although the industry has converged on sidecar proxies (primarily Envoy) as the best available choice for the data plane. 



With the use of a service mesh important to successful microservices implementations, data from 451 Research’s Voice of the Enterprise: DevOps, 2H 2019 survey finds that 13.9% of enterprises are now in production with service mesh, 18.6% have some adoption and about 44% are in planning. 

Please indicate your organization’s adoption status for service mesh 

most important cloud technology graphic

Source: 451 Research’s Voice of the Enterprise: DevOps, 2H 2019 

Lessons learned

It’s telling that most of the service mesh practitioners speaking at KubeCon were from large, technically sophisticated cloud-native organizations such as Lyft, Uber and Pinterest. Although many vendors are pursuing the opportunity to bridge the world of highly scalable cloud-native environments with on-premises data and legacy applications – a mesh is, after all, only as strong as its weakest link – advice gathered from organizations that have implemented service meshes at scale is instructive. 

  • Collaborate with stakeholders starting early in the process. Tech talks, prototyping and encouraging opt-in by service owners who have the most to gain (e.g., supporting a new language or functionality) will help get the implementation off on the right foot. Proactively identify those most likely to be affected. 
  • Start with an ingress solution. Establish a consistent way for external applications to call into the mesh. Vendors that can deliver an ‘easy button’ on-boarding run book for customers seeking to get started with service mesh will find beginning with ingress to be a useful first step. 
  • Prioritize security for services and for the service mesh itself. A primary use case for service mesh is to ensure mTLS encryption of service-to-service traffic; application and sidecar communications need to be rock solid. With so many layers of software-defined interaction, bugs can arise from many sources. Have a systematic way of testing for and finding the source of problems. 
  • Be careful with migration. Service mesh involves a big change in how services communicate with each other. Planning ahead requires service discovery, service registration and security infrastructure to be in place. 
  • Disable unused components. Some service mesh features can cause problems during implementation even if they’re not active; use the simplest set of tools that can address the problem you’re trying to solve. 
  • Never stop investing in performance improvements. The main downsides of service mesh are latency (multiplied by the number of hops in an application) and resource consumption (multiplied by the number of sidecars); batch chatty connections when possible. 
  • Roll out slowly, start small, scale up. Begin with a use case that’s not in the critical path. As problems are ironed out of initial deployments, iterate quickly and scale to other applications/teams. Doing service mesh for one application or team may mean you can end up with a pet, not cattle. 
  • Plan an update process. Roll out updates slowly, qualifying new releases with critical users first. Allow users to do self-service rollbacks to a specific supported version, and keep track of how many users rolled back a given version to point to widespread difficulties. Fix user issues as soon as possible and ensure that rollbacks are temporary. 
  • Be especially careful with newly opened connections. This is where errors are mostly likely to be introduced. 
  • Keep the faith. Despite the difficulties, service mesh adopters say the benefits make the difficulties worthwhile. 


Incremental improvements

Some vendors expect the industry to settle on a single standard, as it did with Kubernetes for container orchestration, and Istio has the pole position as a Google-driven project that plays well with Kubernetes. Google’s decision to keep Istio under its own control for now (rather than donating it to the CNCF under an open governance model) worries some potential customers and makes it a nonstarter for others, but many players (including Tetrate, Aspen Mesh, VMware with NSX Service Mesh and IBM with App Connect) are investing in Istio as a foundation for enterprise-grade managed services to support heterogeneous environments. 

Others expect there to remain a variety of service meshes to address a variety of use cases. Given that businesses are already using a variety of control planes in production, Microsoft introduced the Service Mesh Interface (SMI) project in May, a specification for interoperability across different mesh technologies, including Istio, Linkerd and Consul Connect. The project was launched in partnership with Buoyant, Hashicorp,, Kinvolk and Weaveworks, with support from Aspen Mesh, Canonical, Docker, Pivotal, Rancher, Red Hat and VMware. The goal of SMI is for developer-friendly APIs to lower the barrier to entry and risk of using a service mesh, collaborate with the service mesh community on customer requests, and create a consistent experience across a new ecosystem with an interoperable, extensible framework. Microsoft provided a demo at ServiceMeshCon, but it won’t see the light of day until 2020. 

Among the new projects and features introduced by service mesh maintainers at KubeCon: 

  •, which does not have its own mesh, but offers a Service Mesh Hub dashboard that installs, discovers, manages and groups diverse meshes (including AWS’s App Mesh) together into one big mesh, announced AutoPilot, an operator framework for building workflows on top of service mesh. AutoPilot will help Kubernetes operators to enable mesh metrics and APIs, automated mesh configuration, the ability to expose and invoke webhooks, and out-of-the-box GitOps workflows. The plan is to use telemetry within Kubernetes clusters to drive the behavior of the service mesh for what calls ‘adaptive service mesh’ 
  • Buoyant, maker of Linkerd (one of the few service meshes that doesn’t use the Envoy proxy as a data plane) introduced Dive, a team collaboration tool that captures microservice deployments as events and compiles ownership information and dependencies into a service catalog – ‘like a Facebook for microservices.’ Dive is free and in private beta; there is currently a waitlist for the beta. 
  • Network Service Mesh, a CNCF sandbox project announced in 2018, has attracted 40 contributors and is reportedly receiving interest from financial companies, enterprises and service providers. The project is designed to manage complicated layer 2 and layer 3 use cases in Kubernetes so app service meshes can focus on layer 7 connectivity. 
  • VMware’s NSX Service Mesh is a SaaS offering that runs in public clouds. Based on Istio and Envoy, NSX Service Mesh expands observability and policies to users, data and services, in addition to federation between service mesh clusters. It provides the ability for SecOps and DevOps integrations through policies and tools that allow them to set up application SLOs, access control, encryption and context-based security policies. NSX Service Mesh is built on a global control pane with the agents running on any Kubernetes cluster on any cloud. VMware sees key use cases including application mobility and migration, service mesh HA, E2E encryption for compliance, and visibility for Dev/SecOps. 

abstract technology encryption graphic

Service Mesh University

Catch up on all things service mesh in these seven, on-demand videos with the experts that help you learn more at your own pace. Everything is organized into bite size sections including:

stock photo of electronic equipment

How a Service Mesh Amplifies Business Value

How a Service Mesh Amplifies Business Value

The New Stack Makers Podcast
How a Service Mesh Amplifies Business Value 

In this final episode of The New Stack Makers three-part podcast series featuring Aspen Mesh, Alex Williams, founder and publisher of The New Stack, and correspondent B. Cameron Gain, discuss with invitees how service meshes help DevOps stave off the pain of managing complex cloud native as well as legacy environments and how they can be translated into cost savings. With featured speakers Shawn Wormke, vice president and general manager, Aspen Mesh and Tracy Miranda, director of open source community, CloudBees, they also cover what service meshes can — and cannot — do to help meet business goals and what to expect in the future.

Alex Williams: Hello, welcome to The New Stack Makers, a podcast where we talk about at scale application development, deployment and management. 

Voiceover: Aspen Mesh provides a simpler and more powerful distribution of Istio through a service mesh policy framework, a simpler user experience delivered through the Aspen Mesh UI and a fully supported, tested and hardened distribution of Istio that makes it viable to operate service mesh in the enterprise. 

Alex Williams: Hey, we’re here for another episode of The New Stack Makers, and we are completing our three part series on service mesh and Istio in a discussion with Shawn Wormke, Vice President and General Manager of Aspen Mesh, and Tracy Miranda, Director of Open Source Community at CloudBees, and my good pal and colleague, Bruce Gain, who is co-host today. Bruce is a correspondent with The New Stack. Great to have you all here today. 

Tracy Miranda: Hi, Alex. Thanks for having me. 

Alex Williams: You’re welcome. I just want to start as a note that we’re not talking about the latest machinations with Istio today. We’re focusing on engineering practices. So we are not going to be talking about the Open User Commons [Open Usage Commons] today. There’ll be plenty more discussions on that topic, I’m sure, as time goes on. But for us, our focus is on how do you amplify value with a service mesh? What is it that provides the value in a service mesh architecture? And I think this gets then down to in many ways, that transformation that we’ve seen from monolithic architectures to cloud to now to microservices, whereas in a cloud environment you could be working with a platform as a service environment, you might have multiple APIs and you would have multiple APIs. But now in component based architectures, container technologies, the lifecycle has changed a lot. And that means people have to be aware of a lot more than just a few APIs. Now, it’s a lot of other issues which gets into issues around monitoring, observability, distributed tracing. It goes on and on and on. So both Shawn and Tracy are here to help us with some of the questions that we have. 

Alex Williams: And so I want to just get started with just a little bit discussion about the developer out there who is spending so much time with maintenance issues such as debugging and refactoring. They spend hours of the week on bad code. It’s such a big issue that we found that in some data that it’s nearly an $85 billion dollar worldwide opportunity cost that is lost annually. Now, you can think of opportunity cost is also just the opposite of that, the sunken cost. So you just have to take into consideration what your sunk cost is. But it’s still a huge issue. And so in this area, we want to understand how can that service mesh help increase engineering efficiency to solve these business challenges? And so I wanted to just start it off asking about the developer out there who is building microservices, what’s part of their daily work that’s still quite manual? We have seen a lot about automation and we’re starting to see a lot more automation come into processes for developer. But what is still a manual for them to really take care of? I think of things like having to increasingly do configurations in the Kubernetes environment. 

Shawn Wormke: Alex, I think that’s a great question. I think we’ve grown a lot as an industry in automating our pipelines and a lot of our testing and deployments and pieces like that. But I think that once applications are out and running and in production, a lot of the manual work comes from the monitoring of those things. And when problems start to happen, how do we efficiently get information out of those applications in a way that helps us understand their behavior in a production and runtime environment? And how do we really get to the root cause of the problem, fix it and ensure that we have a good user experience for our customers? And I think that still a big part of that is done manually. We have lots of tools to gather information and put them into things like Prometheus and OpenTracing and Jaeger and tools like that. But figuring out which things that we need to look at, which events inside of there are causing the problems, correlating those all together, getting those to the teams that can take action on them. All of that is still quite manual in our industry. And that’s where I think a service mesh can really help by consolidating that down into a single place to look and a uniform place for all of the teams to come together and get that data and information from. 

Bruce Cameron Gain: Yeah, so as far as those maintenance issues go, prior to the implementation of service mesh, does the onus of that sometimes fall on the developers now, or is this still an operations problem exclusively? 

Shawn Wormke: That’s a great question, Bruce. I think that traditionally what we saw was that a lot of that work was being done inside of the applications themselves and it was being implemented in a lot of different ways by the development teams. And that’s where that question of uniformity comes from. What we’ve seen with our customers is they want to move that down underneath the application and let the application owners really focus on business value code and let the operations teams, the ops part of the DevOps team, really work on providing them the tooling and the common infrastructure it takes to run those things in production in a large enterprise environment. 

Alex Williams: So, Tracy, you understand software lifecycle management and a question I’ve been asking people lately is about what people really are accustomed to and how that’s changing. And so we are clearly in agreement that container technologies are here to stay. I think we’re in clear agreement that monolithic architectures and micro services environments have practices that are similar. But some things are disappearing. Some things are fading away. How is that affecting the software lifecycle management as we make this transition? 

Tracy Miranda: Yeah, I think that’s a great question, and it is a case that so many things are changing, like with the whole onset of containers and microservices, I think we’ve only just started to kind of figure out what that means. I come from it a lot from the continuous delivery perspective and some of the big discussions we’re having there is if you have an app and previously that was a monolith and now it’s a bunch of microservices, how do you even define what the boundaries of that are today? And, you know, how does that influence the way you might deliver different things? And when it comes to service mesh, I think that’s really exciting. I think it’s an area where, again, we’ve just barely scratched the surface of the things we can do with service meshes because they connect all the different services together. And then you can have technology that sits on top of them. Like if you take the example of something like (?), then suddenly you open up this whole world of new things like canarying or monitoring health and kind of this whole bucket of what we sort of term these days, progressive delivery. And this is like super powerful things that you just couldn’t do in pre-containers, pre-distributed systems. So I think it’s really exciting just to watch how people handle it and how they get used to it and then the innovation that’s going to come as a result. 

Bruce Cameron Gain: Would you say that the microservices and containerization, are those really conducive to the service mesh? We’re talking about service meshes and you spoke about the wide variety of environments now that as we move from, say, a legacy system to multi-clouds, et cetera. So I was just wondering, as far as the technology goes, are service meshes really conducive to containerization and microservices. And if so, why and how? 

Tracy Miranda: Yeah, absolutely, it’s because, like, I think there is a threshold where it makes sense and depending on the number of services, if you start off with a very simple architecture and you’re not trying to orchestrate too many things, then perhaps service meshes aren’t, just at the level of complexity that you don’t need, but it doesn’t take long before you can have a significant system where you want to take advantage of the different capabilities. And if I can talk about where I’d like to see it go in terms of I think there’s some really powerful benefits you could get once you start connecting up all the different services. Like I was talking to the folks on the Jenkins X team and James Rawlings was talking about, once you have a service mesh, you could start to imagine some really clever things. For instance, you can have preview environments in general with CI/CD in Jenkins X before you commit your code, you can build it and you can run a preview environment so you can see the change you made in practically not quite production, but it looks pretty realistic and it’s a good way to evaluate the patch. Now if you throw in a service mesh, maybe you can start to do something really clever, like shadowing traffic so you could take some real world traffic that would go to your production environment, but then you can redirect that to that preview environment and now you’re testing it with some actual data. So it’s starting to become really powerful what you can do. And like I don’t know that you can do this yet. I think it’s a bit theoretical. But I think once people start to appreciate the benefits you get for what seems like a complexity cost at the beginning, I think it will end as a tooling becomes easier. I think it will start to become it’s like a no brainer that you want to have this in your systems. 

Bruce Cameron Gain: So we’re still in the early stages. I don’t think a lot of people realize that. 

Tracy Miranda: Yeah, absolutely, I think it’s just people are just getting their heads around it. Why do we need this? And we just need easy ways to get it into folks hands and help them steer clear of the pitfalls so that they can get to, I think, all the real magic you can start to do once you’ve got this orchestration, once you got all these things connected. And then you can start to do pretty clever things. 

Alex Williams: When I hear people talk about scratching the surface on things, it reminds me of just what discovery means and how do you enable discovery. If you don’t have a discovery process, you’ll never know what is unknown to you. And when you start discovering those unknowns, then you start finding more that you did not know before. And maybe we could talk, Shawn, a little bit about how service meshes are architected, for instance, and how the actual work with service mesh architectures help you discover those unknowns. 

Shawn Wormke: Yeah, that’s a great question, Alex, I think to go back a little bit to Tracy’s response around the complexity and sort of when you need a service mesh, I think that’s sort of where it all kind of starts, where customers start to find their unknowns. And we oftentimes talk to our customers about do you really actually need this service mesh at this point in your lifecycle? And so usually what we talk to them about is if you can no longer draw your sort of microservices architecture on a whiteboard or on a piece of paper and be ensured that it actually looks like that when it’s deployed into production, it’s probably time to start thinking about a service mesh. And so that’s a piece of that unknowns. And what we see is when we start to deploy these service meshes that provide you the visibility and observability and just the understanding about how service A is talking to service B, oftentimes customers start to recognize that they’re talking to services that they didn’t actually know were in their network. For example, they’re talking to services that are in AWS and they supposedly have a private cloud architecture. And so we start to uncover a lot of things inside of people’s microservice and container architectures that they had no idea that was going on. 

Shawn Wormke: I think people sort of take for granted the fact that these containers are a unique unit of work and we’re just going to deploy them and let them run. And we don’t have to worry about it because this DevOps team is the one that owns it and manages it, but ultimately in production at large scale and in large corporations, they have a data security policy that they have to follow. They have compliance needs that they need to meet and they need to have things like service mesh running around in there to discover the unknowns, to fix them, to ensure that they can’t talk to the things that they’re not supposed to, then that the things that are talking to other things are who they say they are and that you trust them. So I think that’s a big part of why service mesh architectures will be critical for large scale production deployments in the future. And like Tracy said, we’re just starting to scratch the surface on the use for these things. And it’s wide and almost dependent on the vertical or the industry that you’re deploying them in from a service provider to enterprise, to cloud and cloud native applications. 

Bruce Cameron Gain: You touched upon this already a bit, but what are some of the capabilities that they are offering that we can count on or that somebody could say, OK, in addition to their security, of course, and logging capabilities, etc.? What are some examples? 

Shawn Wormke: I think the first and foremost, you know, I think that there’s a lot of old problems that need to be solved in new form factors and new ways. So I think what it boils down to first and foremost, is there’s a bunch of traffic management features that you need in order to deploy things at scale. Right. So taking things from a test to preproduction to production, you know, in your test environments, things are simple. Things are generally running stable. There’s not a lot of traffic happening there and things are fine. But when you get into production, you need traffic management features like basic load balancing between your services that are running around. And that load balancing has to be intelligent and has to understand how those services are responding and making sure that that your applications running as efficiently as possible, things like circuit breaking, understanding when a service no longer exists. And rather than waiting for the TCP timeout to happen in two minutes worth of requests, going to that thing, going off into a black hole, you know, those things don’t happen when you have a service mesh there. Then we move into security features. Like you said, there’s a lot of encryption features inside of service mesh, people use them for certificate management inside of their container environments, mutual TLS authentication, authorization of applications. But then we move into sort of more of the day two sort of features there. And that’s integration with a lot of their enterprise systems. So most enterprises are complex places that have legacy applications talking to greenfield cloud native applications. They need a way for all of those systems to talk together and service mesh can be that bridge between the two. We see people using that oftentimes even with just certificate management and mTLS and enabling that in their legacy applications and using features of Istio and their certificate management pieces to enable that, all the way down then to, like you said, logging, tracing,visibility features, being able to gather telemetry in a single place consistently across all of your applications provides a huge amount of benefit there as far as architectures go. 

Bruce Cameron Gain: And Tracy, in the big picture sense, how would you say that this integrates with the overall software lifecycle management, a personal interest as well? I would be curious to see how that overlaps also with the developer experience. 

Tracy Miranda: So on the lifecycle side, which I’ll tackle first. I think when it comes to kind of getting your code out into production and I think we’ve touched on this, but let me emphasize it. So this comes down to your deployment methodology and there’s many different deployment methodologies. But ultimately, the one you want to get to in the ideal situation is canary deployments. And that kind of ticks all the boxes in terms of highly available, responsiveness, progressive rollout and ability to roll back. And like, the only way you’re going to get to that is by using a service mash and taking advantage of the load balancing. So, you know, I think that is where everybody is heading. And as you can build in the necessary infrastructure, that makes all the difference to how you can then get features into the hands of your customers, how you can get that feedback. Is it going well? Is there going to be some problem? Should we dial it back? And can we do that easily and in an automated way without having to suffer a big failure for customers? So I think that when we talk about lifecycles, it’s towards the end of the lifecycle, just getting things into the hands of users. 

Tracy Miranda: And then I think on the experience side. So it was specifically you asked about developer experience. So I think there were probably I think there’s still a lot of confusion. If I take developer experience as a whole, I think we’re still in the case with I don’t think there’s enough easy ways to do it. I think we’ve got the early adopters who are super good, able to get in, able to deal with different situations and know what they’re doing. But I think there’s still a lot we can do to kind of roll it out for the masses. And I have no doubt all the various communities developing service meshes are going to come up with some things that just make it easier to use, easier to understand when to apply things, how to configure things. And I think that’s the challenge and sticking with some of the complexity, prepackacing things in a way that is easier to get running, but not oversimplifying -. 

Bruce Cameron Gain: Tracy, you mentioned you touched upon this a bit, but what kind of learning curve can DevOps teams expect? And does the onus fall on the operations team, the developer teams, security teams, or who? 

Tracy Miranda: Yeah, the reality is, like you take something like Kubernetes and a lot of the teams, like we have a lot more people talking about it today, but I still find the vast majority of folks haven’t even gotten a proper handle on the distributed nature of Kubernetes. And then you start to throw in the rapid change in the cycles of how quickly is this version of Kubernetes I’m using going to be not supported? How quickly do I have to keep up with innovations? And I think it’s just a lot to contend with. And so I think that’s where it folds in kind of the best practices that we have around continuous delivery and software lifecycle and then probably say we’re going back to basics on how teams do that, really looking at the Accelerate book from Nicole Forsgren and Jez Humble those kind of underlying the principles which your team is going to need to adopt any new technology, including service mesh. 

Alex Williams: Back to the basics. Now, Shawn, my question for you is about the open source communities out there who are doing most of the upstream development and often they are so immersed into the actual code and making sure that it works that a developer experience becomes another parallel challenge to manage. How is that parallel challenge getting managed? Because we very well know that the Kubernertes plumbing is pretty much done. You can use Kubernetes. The question now is how do you build on top of it? And we are just starting to see how organizations are building on top of it. I think this speaks to what Tracy was saying, that we are starting to see some deployments, but it’s by no means are we seeing everyone do it across their organizations. So when you’re thinking about that upstream development, what are you thinking about? 

Shawn Wormke: Yeah, so Aspen Mesh is very active in the Istio community, and we have worked very hard to represent our enterprise customers in that community because I think of what you just said Alex, those developers are mired in the code. They are focused on producing the highest quality piece of software that they can. But that doesn’t always translate into a good sort of end user experience and not always into a good, manageable product oftentimes. And so, a big part of what we do is try to represent our enterprise customers and service provider customers, quite frankly, in that environment and making sure that they’re trying to make at least sane choices that don’t put these large deployments in a place that they can’t recover from or that they have an instable network. And then honestly, I think that’s an opportunity for companies like Aspen Mesh then to build on top of that. A large part of what we do is focus on how to make enterprises successful using that software and dealing with that lifecycle management of the Istio pieces itself. How does this piece of technology work within large organizations? We talk a lot to our customers about that and where it fits into their organizations. Oftentimes we see developers bringing in the technology into the company in a discovery type of mode and sort of proof of concept mode. Eventually, it gets turned over into a platform team, which then works on really helping their developers have access to the pieces that they need and that they understand while the platform team runs the other part of the business. And that’s really what we have focused over the last year or so on, is helping customers integrate that into their organizational structures just as much as we have integrating the technology into their Kubernetes stack. And that’s something that I think is oftentimes overlooked in many open source communities, is how this actually fits and how it actually works in a real large customer environment, customer deployment. 

Bruce Cameron Gain: So we have security, observability, traffic management, etc., logging capabilities, whatnot, but among those areas, where do we really need to see improvements in the immediate mid-term, or in all three? 

Shawn Wormke: I think we’ll continue to see improvements in all three. I would say the large majority of our customers come to us for the security aspects first for what’s there. They have some need to have the encryption there. They come to us for that. But I think the long term, the real potential here is around the observability pieces, because a lot of the traffic management features, as I mentioned a little earlier, these are old problems that have been solved before. We just need to sort of repackage them and reformat them. And so I think that those are known problems and it’s relatively straightforward to solve them. I won’t say it’s easy, but I think it’s relatively straightforward to solve them in this new world. But I think the real opportunities here are around observability and helping people understand what’s going on and helping them to really to provide the best user experience they can to their end customers, because I think that’s really where they want to focus is the profit center of the business, not the cost center side. And so reducing the amount of effort it takes for people to find and fix issues, deploy them, ensure that they’re going to work, ensure that they’re going to solve the problems that they were originally trying to fix is a big part of where we can see a lot of improvements in service mesh overall and in the industry in general, I think. 

Bruce Cameron Gain: So the service mesh in many respects is just the starting point. 

Shawn Wormke: Absolutely. I think you can think of it as the tap into the network that pulls all that stuff out. Right. I think the real work and the real opportunity then is on top of that, what you do with that data, how well you organize it, how you get it to the people that need it and how they take action on it and make decisions off of that data. 

Tracy Miranda: What I hope it would enable is just this culture of kind of experimentation where, you know, now you have all these things at your fingertips and you can afford to say I’ve this theory, you’re going to do this. And now we’ve got super fine grained control on traffic management and access that we can afford to see what happens and see how things play out. And that could be the really exciting part. Get to this, a business using service mesh and is taking full advantage of it. 

Alex Williams: So when you’re thinking about taking full advantage of it, one of the most interesting aspects of Kubernetes is how it’s built for a stateless environment. But so much of the work now to make Kubernetes work is to make it work with stateful environments. And so you have a lot of applications out there that need to be thought of in a way that considers issues such as storage, and storage and traditional networking and traditional enterprises are based upon how do you develop architectures that might be 10, 20 years old. Those architectures are monolithic and you just pour the code into them and then you’ve got to figure out how to get them all configured and then you’ve got to get them running and on and on and on. How are you thinking about the stateful applications in the software lifecycle management with service mesh in mind? 

Tracy Miranda: That’s a good question and I’m not sure I have a good answer for that. I think it’s emerging, I gave a talk, copresenting at KubeCon, and we’re looking there at how do you take a monolith and break it up and run it in the cloud with microservices and take advantage of that. But I have to say, even in that talk there’s so much to cover and we don’t even get to kind of aspects of service mesh. So I don’t think it’s that obvious that I certainly don’t have a good answer for that today. 

Alex Williams: Then I guess, Shawn, I’m wondering what’s the use of service mesh then? Because that’s pretty much then the kind of what everyone else is trying to figure out is how to get these stateful applications to work and speaks of why you don’t have adoption in Kubernetes and why the pipes are great. But if no one’s using the pipes, who cares? 

Shawn Wormke: Yeah, I think it’s going to be a big area of focus for many of these technologies in the coming years. I think that this is where Kubernetes and service mesh is still super early. But in Kubernetes this is sort of where the rubber meets the road is how you can actually deploy and how you can put it into these large places that aren’t all greenfield and aren’t cloud-native first. It’s when the large banks and the large airlines and manufacturing companies can start to take this and use their legacy systems with their new things and enable a speed that they haven’t seen before. I think it’s going to be a big challenge is something that we’ve been working with customers with every day to try to help them figure out. And a lot of it is just really understanding the fact that we’re going to have to build things that don’t always take the greenfield first approach. We’re going to have to embrace the fact that there is brownfield out there. We’re going to have to understand that there’s legacy protocols running around, that we’re going to have to support, you know, things like that and that stateful things are not going away any time soon. I mean, if we’re still writing code for banking applications that were written in the 60s, 70s and 80s, I don’t think that stuff’s going away any time soon. So I think just like Tracy said, this is a part where we’re going to have to figure out and we’re going to figure out if we want these types of technologies to be successful for the future, because companies have huge investments in that legacy infrastructure and they need it to bring it forward, whether it’s for failing systems or financials or whatever. It has to come forward for sure. 

Bruce Cameron Gain: Maybe you’re underselling a bit. I mean, my sense is that you are able, as far as the observability goes, are able to somewhat manage or at least observe your legacy storage, for example, your database, particularly your databases. So, you know, right now, how good can it be and how does automation come into play? 

Shawn Wormke: Yeah, I think we can get some amount of observability there. But I think that to get all the benefits out of the service mesh they need to understand those protocols is, I think, where we’re having a little bit of issue there. Right. So if you think about a layer-7 trace compared to something that we could see for a SQL database running over TCP those two things, we’re not going to be able to get you sort of the same level of visibility on those two things. So I think that’s where potentially I’m underselling a little bit. But I also think that the expectation of all these amazing features that service mesh have or what they think about when they put them there, otherwise it just looks sort of like a packet capture to people sometimes. But yeah, I think that theres security pieces that we can do there. For example, we can extend mTLS outside of Kubernetes clusters and outside of the service mesh, for example, there we can do a bunch of things around egress and ingress control for these legacy things on a very granular level that wasn’t available before and things like that that are available and that are there. But again, it’s so early for those things. And a lot of our customers have gone down the path of sort of looking at things in two buckets. We have the new and we have the old. And then they eventually get to the point where like, how do we make these two things work together? And that’s where I think we’re going to be spending a lot of time over the next eighteen to twenty-four months helping them figure that out as we roll these into real environments. Because, you know, we’ve worked with many of these customers and say, oh, we’re all greenfield, greenfield, greenfield, and then we work with them for a couple of months. Then it’s like, oh, but we need to access this database or the storage system over here. And then that’s where this comes from. 

Bruce Cameron Gain: Tracy, would you agree? 

Tracy Miranda: Yeah, I think the whole migration is something the whole industry is kind of struggling with, like I certainly see it as continuous delivery and at the Continuous Delivery Foundation, you know, we have that range of technology, ten year old technology to the brand new. And it’s just a massive divide. So we have both end users who are kind of trying to share their case studies of what they’ve done. And they’re hoping that we can have these conversations and start seeing the patterns that people can use to simplify that. But it’s still I think it’s kind of like the million dollar question at the moment. How do you bridge new to the old and how do you not lose all the investment you have in existing systems? And I don’t think we have good answers today. So it’s something we have to work on as an industry. 

Shawn Wormke: And I think too you can’t overlook the fact that the expertise to do some of these things is a very limited supply. And so to think that all companies are going to have access to the talent that it takes to do some of this is a stretch. Right. And so there has to be a massive amount of learning by kind of all people involved in order to make these kinds of transitions successful, because it’s very hard to hire the right people. It’s very hard to pull people off of their existing jobs, working on their legacy systems to learn the new things. And so there’s a lot that has to happen in the next few years to make this transition successful. 

Bruce Cameron Gain: I’ve actually heard there’s been an emergence of in-house, for example, of the service mesh expert, resident expert, kind of like the Jenkins resident expert. And what I’m wondering, though, is, you know, eventually during the next 18 to 24 months, will the automation aspects of service mesh come into play so that it’ll in many respects, not only will it help to reduce the learning curve, but definitely I would expect it to reduce the amount of operations work. I mean, that’s the whole thing, isn’t it? 

Shawn Wormke: Yeah. It’s interesting you say service mesh expert. I would say it’s experts, plural. We often times see multiple small teams working on this. And and oftentimes there’s some of the most valuable talent inside of the organization. So they’re the architects, they’re the senior engineers who are working on this. So that’s one of the things we talk a lot about. And again, another place where commercialisation of some of this does come in and help is that if you can take a team of six or eight people who are managing and running this and help them deal with their lifecycle management, help them deal with upgrades and making sure that they work and all that, you can reduce that team from six or eight people down to one. And that’s a lot better for companies. They can put those seven other brains to work on next generation problems or solving some of these stateless problems or working on retraining other folks inside of the organization. But again, it’s early days for these things. And this is bleeding edge technology. These are early adopters. And so the price that they’re paying for that, whether that’s to commercial vendors or whether it’s for the people it takes in-house to run that, is going to be high. It’ll eventually come down as automation picks up, as productization comes in, as the open source community continues to evolve the product and make things easier for their users, that costs will eventually come down to the end users for sure. 

Bruce Cameron Gain: Tracy, I was wondering if or what was your vision for when and how automation should take over? In many ways that it is just starting to now? 

Tracy Miranda: Yes, just going back to the question you had on developer experience and maybe talking about a couple of things I’m seeing, so one example specifically I’m aware of, I think, is X tries to act as an orchestration tool. So it’s pulling together different tools and trying to simplify the user experience. We don’t need to know everything there is about Kubernetes or their specific distribution in that same way, like today, you can get started with it and you can use it with Istio and with Flagger and automatically get set up with canary deployments with very few commands. And it will set up things in the different ways. And I don’t think it’s the case that you can ignore how it works, like you still need to understand what’s going on under the hood and be able to deal with things. But I think the difference there is that it gets you up and running with something like you can follow this pre-canned example and you can have something running and then you can tweak it. So it’s not like you’re putting together your initial system from a set of parts. I think that’s one thing I’ve seen where we’ll start to have these high level tools which aim to pull things together and aim to make some of the decisions for you in a more opinionated way with this expectation of, OK, I just want to get going first and then I’ll figure out how to tweak it and then I’ll figure out what’s going on let me just take some defaults, all the various decisions I can make and how to set this up. And I think that helps accelerate the starting curve. 

Alex Williams: Great. I think we have time for about two more questions, so I’m going to ask a question and I think Bruce is going to ask one. And I want to help kind of recap kind of what we’ve discussed a little bit here. And we’ve talked a lot about service mesh, the need for service mesh and how service mesh can be helpful in helping us understand those unknown unknowns, as I like to describe them. We’ve talked a little bit about how it fits into software lifecycle management as we’re thinking more about Kubernetes, how is it going to continue to fit into there? And we’ve talked a little bit about the challenges that companies face with brownfield applications and legacy applications and the issues of state and statelessness that are just inherent in complex at-scale architectures. Tracy brought up the million dollar question. And so I want to, like, get an idea of how that million dollar question has been resolved in the past and what can we learn, Tracy, from the evolution of continuous development? For instance, I was at a Kong event yesterday and we talked about continuous integration and how it’s really not as relevant anymore. I think platform as a service is not as relevant anymore. Now we’re talking about container as a service, but what can we learn from continuous delivery and apply to service mesh to help us get the answers to those big questions? 

Tracy Miranda: Yeah, good questions. I think there’s different levels, you can answer that, but I’m going to go back to the open source. I’m a big fan of open source. I’m a big fan of community. I think we do have pretty powerful communities around the technology. And that’s what’s going to make the difference to how do we solve these problems. And then if somebody solves it or comes up with some clever innovation that works for them, how does that then get shared and how do we communicate? Hey, look what I’ve done. Look how I’ve solved this problem. What do you think? Is this good? And it’s just letting people have the freeform innovation, but then coming back and sharing it and then building on it and then taking it into tooling that can be democratized for people to use. And I think that’s just kind of the beauty of open source of the code available, the permissionless nature of it, and just everybody trying to solve things and trying to get to that next level. So I think open source community is just a big part of how we will solve these problems and we’ll do it with the entire ecosystem. You know, the companies involved, the people involved. 

Bruce Cameron Gain: I guess my question then would be, as far as the open source community’s contribution goes, just how critical has it been or crucial has it been? You kind of touched upon this, but at the same time, I’d be curious to know, you know, you mentioned, you know, we are in the early stages, obviously. And, you know, have there been any really pleasant surprises or contributions that really stood out and have an effect to the next year or so?  

Tracy Miranda: I referenced this earlier, the tools like the Flagger tool, which kind of sits on top of service mesh and helps instruct it, I think that’s a perfect example of the kind of innovation functionality that will then help us realize the gains to be had from service mesh technology. 

Bruce Cameron Gain: And Shawn? 

Shawn Wormke: Yeah, I think I agree, and we’re actually big fans of the Flagger tool as well, but I think that the overall ecosystem coming together to solve these problems has been a very interesting thing to watch. But I think the most pleasant surprise for me has been sort of to watch the maturation of the Istio project overall over the last two years and to really see sort of the quality improvements, the scalability improvements all the way down to, you know, now having early disclosure processes for inside of there, which is a recognition that companies are trying to make their living off of it and large companies are deploying it. And they have to have a way to deal with real world enterprise kind of problems. I think that’s been a great thing for me to watch over the years. And I know that my team and my customers are very appreciative of all the work that the community does, and we love being partners with them and being part of that ecosystem overall. So I think that’s just been a great couple of years for us. 

Alex Williams: Well, let’s hope it’s another great couple of years ahead. We talked a lot about service mesh and I get the sense that a lot of people are still learning quite a bit. And I think that goes into the actual early adopters themselves. And in this series that we’ve had with the Aspen Mesh team and others in the industry, that’s really quite apparent. And so we look forward to understanding how the community is going to start working through these issues more. But I think it also speaks to the larger Kubernetes community and how they’re working through issues as they start to build more on top of this Kubernetes architecture. So I want to thank you all for participating today. Shawn Worme of Aspen Mesh, Tracy Miranda from CloudBees, thank you so much for joining us. And Bruce Gain. Good to see you here today. Thank you very much for your time. 

Shawn Wormke: Thanks, Alex.

Tracy Miranda: Thanks for having me. This was great. 

Voiceover: Aspen Mesh provides a simpler and more powerful distribution of Istio through a service mesh policy framework, a simpler user experience delivered through the Aspen Mesh UI and a fully supported, tested and hardened distribution of Istio that makes it viable to operate service mesh in the enterprise. 

Alex Williams: Listen to more episodes of The New Stack Makers at, please rate and review us on iTunes, like us on YouTube and follow us on SoundCloud. Thanks for listening and see you next time. 

Sailing Faster with Istio

While the extraordinarily large shipping container, Ever Given, ran aground in the Suez Canal, halting a major trade route that has caused losses in the billions, our solution engineers at Aspen Mesh have been stuck diagnosing a tricky Istio and Envoy performance bottleneck on their own island for the past few weeks. Though the scale and global impacts of these two problems is quite different, it has presented an interesting way to correlate a global shipping event with the metaphorical nautical themes used by Istio. To elaborate on this theme, let’s switch from containers carrying dairy, and apparently everything else under the sun, to containers shuttling network packets.

To unlock the most from containers and microservices architecture, Istio (and Aspen Mesh) uses a sidecar proxy model. Adding sidecar proxies into your mesh provides a host of benefits, from uniform identity to security to metrics and advanced traffic routing. As Aspen Mesh customers range from large enterprises all the way to service providers, the performance impacts of adding these sidecars is as important to us as the benefits outlined above. The performance experiment that I’m going to cover in this blog is geared toward evaluating the impact of adding sidecar proxies in high throughput scenarios on the server or client, or both sides.

We have encountered workloads, especially in the service provider space, where there are high requests or transactions-per-second requirements for a particular service. Also, scaling up — i.e., adding more CPU/memory — is preferable to scaling out. We wanted to test the limits of sidecar proxies with regards to the maximum achievable throughput so that we can tune and optimize our model to meet the performance requirements of the wide variety of workloads used by our customers.

Throughput Test Setup

The test setup we used for this experiment was rather simple: a Fortio client and server running on Kubernetes on large AWS node instance types like burstable t3.2xlarge with 8 vCPUs and 32 GB of memory or dedicated m5.8xlarge instance types which have 32 vCPUs and 128 GB of memory. The test was running a single instance of the Fortio client and server pod with no resource constraints on their own dedicated nodes. The Fortio client was run in a mode to maximize throughput like this:

The above command runs the test for 60 seconds with queries per second (QPS) 0 (i.e. maximum throughput with a varying number of simultaneous parallel connections). With this setup on a t3.2xlarge machine, we were able to achieve around 100,000 QPS. Further increasing the number of parallel connections didn’t result in throughput beyond ~100K QPS, signaling a possible CPU bottleneck. Running the same experiment on an m5.8xlarge instance, we could achieve much higher throughput around 300,000 QPS or higher depending upon the parallel connection settings.

This was sufficient proof of CPU throttling. As adding more CPUs increased the QPS, we felt that we had a reasonable baseline to start evaluating the effects of adding sidecar proxies in this setup.

Adding Sidecar Proxies on Both Ends

Next, with the same setup on t3.2xlarge instances, we added Istio sidecar proxies on both Fortio client and server pods with Aspen Mesh default settings; mTLS STRICT setting, access logging enabled and the default concurrency (worker threads) of 2. With these parameters, and running the same command as before, we could only get a maximum throughput of around ~10,000 QPS.

This is a factor of 10 reduction in throughput. This was expected as we had only configured two worker threads, which were hopefully running at their maximum capacity but could not keep up with client load.

So, the logical next step for us was to increase the concurrency setting to run more worker threads to accept more connections and achieve higher throughput. In Istio and Aspen Mesh, you can set the proxy concurrency globally via the concurrency setting in proxy config under mesh config or override them via pod annotations like this:

Note that using the value “0” for concurrency configures it to use all the available cores on the machine. We increased the concurrency setting from two to four to six and saw a steady increase in maximum throughput from 10K QPS to ~15K QPS to ~20K QPS as expected. However, these numbers were still quite low (by a factor of five) as compared to the results with no sidecar proxies.

To eliminate the CPU throttling factor, we ran the same experiment on m5.8xlarge instances with even higher concurrency settings but the maximum throughput we could achieve was still around ~20,000 QPS.

This degradation was far from acceptable, so we dug into why the throughput was low even with sufficient worker threads configured on the sidecar proxies.

Peeling the Onion

To investigate this issue, we looked at the CPU utilization metrics in the server pod and noticed that the CPU utilization as a percentage of total requested CPUs was not very high. This seemed odd as we expected the proxy worker threads to be spinning as fast as possible to achieve the maximum throughput, so we needed to investigate further to understand the root cause.

To get a better understanding of low CPU utilization, we inspected the connections received by the server sidecar proxy. Envoy’s concurrency model relies on the kernel to distribute connections between the different worker threads listening on the same socket. This means that if the number of connections received at the server sidecar proxy is less than the number of worker threads, you can never fully use all CPUs.

As this investigation was purely on the server-side, we ran the above experiment again with the Fortio client pod, but this time without the sidecar proxy injected and only the Fortio server pod with the proxy injected. We found that the maximum throughput was still limited to around ~20K QPS as before, thereby hinting at issues on the server sidecar proxy.

To investigate further, we had to look at connection level metrics reported by Envoy proxy. Later in this article, we’ll see what happens to this experiment with Envoy metrics exposed. (By default, Istio and Aspen Mesh don’t expose the connection-level metrics from Envoy.)

These metrics can be enabled in Istio version 1.8 and above by following this guide and adding the appropriate pod annotations corresponding to the metrics you want to be exposed. Envoy has many low-level metrics emitted at high resolution that can easily overwhelm your metrics backend for a moderately sized cluster, so you should enable this cautiously in production environments.

Additionally, it can be quite a journey to find the right Envoy metrics to enable, so here’s what you will need to get connection-level metrics. On the server-side pod, add the following annotation:

This will enable reporting for all listeners configured by Istio, which can be a lot depending upon the number of services in your cluster, but only enable the downstream connections total counter and downstream connections active gauge metrics.

To look at these metrics, you can use your Prometheus dashboard, if it’s enabled, or port-forward to the server pod under test to port 15000 and navigate to http://localhost:15000/stats/prometheus. As there are many listeners configured by Istio, it can be tricky to find the correct one. Here’s a quick primer on how Istio sets up Envoy configuration. (You can find the complete list of Envoy listener metrics here.)

For any inbound connections to a pod from clients outside of the pod, Istio configures a virtual inbound listener at, which receives all the traffic from iptables’ redirect rules. This is the only listener that’s actually configured to receive connections from the kernel, and after the connection is received, it is matched against filter chain attributes to proxy the traffic to the correct application port on localhost. This means that even though the Fortio client above is targeting port 8080, we need to look at the total and active connections for the virtual inbound listener at instead of Looking at this metric, we found that the number of active connections were close to the configured number of simultaneous connections on the Fortio client side. This invalidated our theory about the number of connections being less than worker threads.

The next step in our debugging journey was to look at the number of connections received on each worker thread. As I had alluded to earlier, Envoy relies on the kernel to distribute the accepted connections to different worker threads, and for all the worker threads to be fully utilizing the allotted CPUs, the connections also need to be fairly balanced. Luckily, Envoy has per-worker metrics for listeners that can be enabled to understand the distribution. Since these metrics are rooted at listener.<address>.<handler>.<metric name>, the regex provided in the annotation above should also expose these metrics. The per-worker metrics looked like this:

As you can see from the above image, the connections were far from being evenly distributed among the worker threads. One thread, worker 10, had 11.5K active connections as compared to some threads which had around ~1-1.5K active connections, and others were even lower. This explains the low CPU utilization numbers as most of the worker threads just didn’t have enough connections to do useful work.

In our Envoy research, we quickly stumbled upon this issue, which very nicely sums up the problem and the various efforts that have been made to fix it.

Image via Pixabay.

So, next, we went looking for a solution to fix this problem. It seemed like, for the moment, our own Ever Given was stuck as some diligent worker threads struggled to find balance. We needed an excavator to start digging.

While our intrepid team tackled the problem of scaling for high-throughput workloads by adding sidecar proxies, we encountered a bottleneck not entirely unlike what the Ever Given experienced not long ago in the Suez Canal.

Luckily, we had a few more things to try, and we were ready to take a closer look at the listener metrics.

Let There Be Equality Among Threads!

After parsing through the conversations in the issue, we found the pull request that enabled a configuration option to turn on a feature to achieve better balancing across worker threads. At this point, trying this out seemed worthwhile, so we looked at how to enable this in Istio. (Note that as part of this PR, the per-worker thread metrics were added, which was useful in diagnosing this problem.)

For all the ignoble things EnvoyFilter can do in Istio, it’s useful in situations like these to quickly try out new Envoy configuration knobs without making code changes in “istiod” or the control plane. To turn the “exact balance” feature on, we created an EnvoyFilter resource like this:

With this configuration applied and with bated breath, we ran the experiment again and looked at the per-worker thread metrics. Voila! Look at the perfectly balanced connections in the image below:

Measuring the throughput with this configuration set, we could achieve around ~80,000 QPS, which is a significant improvement over the earlier results. Looking at CPU utilization, we saw that all the CPUs were fully pegged at or near 100%. This meant that we were finally seeing the CPU throttling. At this point, by adding more CPUs and a bigger machine, we could achieve much higher numbers as expected. So far so good.

As you may recall, this experiment was purely to test the effects of server sidecar proxy, so we removed the client sidecar proxy for these tests. It was now time to measure performance with both sidecars added.

Measuring the Impacts of a Client Sidecar Proxy

With this exact balancing configuration enabled on the inbound port (server side only), we ran the experiment with sidecars on both ends. We were hoping to achieve high throughputs that could only be limited by the number of CPUs dedicated to Envoy worked threads. If only things were that simple.

We found that the maximum throughput was once again capped at around ~20K QPS.

A bit disappointing, but since we then knew about the issue of connection imbalance on the server side, we reasoned that the same could happen on the client side between the application and the sidecar proxy container on localhost. First, we enabled the following metrics on the client-side proxy:

In addition to the listener metrics, we also enabled cluster-level metrics, which emit total and active connections for any upstream cluster. We wanted to verify that the client sidecar proxy was sending a sufficient number of connections to the upstream Fortio server cluster to keep the server worker threads occupied. We found that the number of active connections mirrored the number of connections used by the Fortio client in our command. This was a good sign. Note that Envoy doesn’t report cluster-level metrics at the per-worker level, but these are all aggregated, so there’s no way for us to know how the connections were distributed on the outbound side.

Next, we inspected the listener connection statistics on the client side similar to the server side to ensure that we were not having connection imbalance issues. The outbound listeners, or the listeners set up to handle traffic originating from the application in the same pod as the sidecar proxy, are set up a bit differently in Istio as compared to the inbound side. For outbound traffic, a virtual listener “” is created similar to the listener on “,” which is the target for iptables redirect rules. Unlike the inbound side, the virtual listener hands off the connection to the more specific listener like “” based on the original destination address. If there are no specific matches, then the listener configuration in the virtual outbound takes effect. This can block or allow all traffic depending on your configured outbound traffic policy. In the traffic flow from the Fortio client to server, we expected the listener at “” to be handling connections on the client-side proxy, so we inspected connections metrics at this listener. The listener metrics looked like this:

The above image shows the connection imbalance issue between worker threads as we saw it on the server side. However, the connections on the outbound client-side proxy were only getting handled by one worker thread which explains the poor throughput QPS numbers. Having fixed this on the server-side, we applied a similar EnvoyFilter configuration with minor tweaks for context and port to address this imbalance:

Surely, applying this resource would fix our issue and we would be able to achieve high QPS with both client and server sidecar proxies with sufficient CPUs allocated to them. Well, we ran the experiment again and saw no difference in the throughput numbers. Checking the listener metrics again, we saw that even with this EnvoyFilter resource applied, only one worker thread was handling all the connections. We also tried applying the exact balance config on both virtual outbound port 15001 and outbound port 8080, but the throughput was still limited to 20K QPS.

This warranted the next round of investigations.

Original Destination Listeners, Exact Balance Issues

We went around looking in Envoy code and opened Github issues to understand why the client-side exact balance configuration was not taking effect, while the server side was working wonders. The key difference between the two listeners, other than the directionality, was that the virtual outbound listener “” was an original destination listener, which hands over connections to other listeners matched on the original destination address. With help from the Istio community (thanks, Yuchen Dai from Google), we found this open issue, which explains this behavior in a rather cryptic way.

Basically, the current exact balance implementation relies on connection counters per worker thread to fix the imbalance. When the original destination is enabled on the virtual outbound listener, the connection counter on the worker thread is incremented when a connection is received, but as the connection is immediately handed to the more specific listener like “,” it is decremented again. This quick increase and decrease in the internal count spoofs the exact balancer into thinking the balance is perfect as all these counters are always at zero. It also appears that applying the exact balance on the listener that handles the connection, “” in this case, but doesn’t accept the connection from the kernel has no effect due to current implementation limitations.

Fortunately, the fix for this issue is in progress, and we’ll be working with the community to get this addressed as quickly as possible. In the meantime, if you’re getting hit by these performance issues on the client side, scaling out with a lower concurrency setting is a better approach to reach higher throughput QPS numbers than scaling up with higher concurrency and worker threads. We are also working with the Istio community to provide configuration knobs for enabling exact balance in Envoy to optionally switch default settings so that everyone can benefit from our findings.

Working on this performance analysis was interesting and a challenge in its own way, like the small tractor next to the giant ship trying to make it move.

Well, maybe not exactly, but it was a learning experience for me and my team, and I’m glad we are able to share our learnings with the rest of the community as this aspect of Istio is often overlooked by the broader vendor ecosystem. We will run and publish performance numbers related to the impact of turning on various features such as mTLS, access logging and tracing in high-throughout scenarios in future blogs, so if you’re interested in this topic, subscribe to our blog to get updates or reach out to us with any questions.

Thank you Aspen Mesh team members Pawel and Bart who patiently and diligently ran various test scenarios, collected data and were uncompromising in their pursuit to get the last bit out of Istio and Aspen Mesh. It’s not surprising. After all, being part of F5, taking performance seriously is just part of our DNA. 

Improve your application with service mesh

Improving Your Application with Service Mesh

Engineering + Technology = Uptime 

Have you come across the term “application value” lately? Software-first organizations are using it as a new form of currency. Businesses delivering a product or service to its customers through an application understand the growing importance of their application’s security, reliability and feature velocity. And, as applications that people use become increasingly important to enterprises, so do engineering teams and the right tools 

The Right People for the Job: Efficient Engineering Teams 

Access to engineering talent is now more important to some companies than access to capital. 61% of executives consider this a potential threat to their business. With the average developer spending more than 17 hours each week dealing with maintenance issues, such as debugging and refactoring, plus approximately four hours a week on “bad code” (representing nearly $85 billion worldwide in opportunity cost lost annually), the necessity of driving business value with applications increases. And who is it that can help to solve these puzzles? The right engineering team, in combination with the right technologies and tools. Regarding the piece of the puzzle that can solved by your engineering team, enterprises have two options as customer demands on applications increase:  

  1. Increase the size and cost of engineering teams, or  
  2. Increase your engineering efficiency.  

Couple the need to increase the efficiency of your engineering team with the challenges around growing revenue in increasingly competitive and low margin businessesand the importance of driving value through applications is top of mind for any business. One way to help make your team more efficient is by providing the right technologies and tools. 

The Right Technology for the Job: Microservices and Service Mesh 

Using microservices architectures allows enterprises to more quickly deliver new features to customers, keeping them happy and providing them with more value over timeIn addition, with microservices, businesses can more easily keep pace with the competition in their space through better application scalability, resiliency and agility. Of course, as with any shift in technology, there can be new challenges.  

One challenge our customers sometimes face is difficulty with debugging or resolving problems within these microservices environments. It can be challenging to fix issues fast, especially when there are cascading failures that can cause your users to have a bad experience on your applicationThat’s where a service mesh can help. 

Service mesh provides ways to see, identify, trace and log when errors occurred and pinpoint their sources. It brings all of your data together into a single source of truth, removing error-prone processes, and enabling you to get fast, reliable information around downtime, failures and outages. More uptime means happy users and more revenue, and the agility with stability that you need for a competitive edge. 

Increasing Your Application Value  

Service mesh allows engineering teams to address many issues, but especially these three critical areas: 

  • Proactive issue detection, quick incident response, and workflows that accelerate fixing issues 
  • A unified source of multi-dimensional insights into application and infrastructure health and performance that provides context about the entire software system 
  • Line of sight into weak points in environments, enabling engineering teams to build more resilient systems in the future  

If you or your team are running Kubernetes-based applications at scale and are seeing the advantages, but know you can get more value out of them by increasing your engineering efficiency and uptime for your application's’ users, it’s probably time to check out a service mesh. 

Delphi Simplifies Kubernetes Security with Aspen Mesh

How Delphi Simplifies Kubernetes Security with Aspen Mesh

Delphi's Mission

Delphi delivers software solutions that help professional liability insurers streamline their operations and optimize their business processes. Leveraging a highly flexible technology platform, Delphi enables companies to reduce costs, increase operational efficiency, and improve business intelligence. The Delphi Digital Platform is a cloud-based software solution that connects customers, agents, employees, and third parties to Delphi’s core transactional systems and other solutions in the digital ecosystem. This provides professional liability insurance carriers with modern microservice-based software solutions, giving them: 

  • The ability to link their business directly to their customers’ needs 
  • The flexibility to quickly respond to changing market conditions 
  • A cloud platform providing an environment for acquisition integration 

Delphi's Technology Stack

The infrastructure team at Delphi has fully embraced a cloud-native stack to deliver the Delphi Digital Platform to its customers. The team leverages Kubernetes to effectively manage builds and deploys. Delphi planned to use Kubernetes from the start, but was looking for a simpler security solution for their infrastructure that could be managed without implementations in each service. 

The Challenge

Operating in the highly regulated healthcare industry, privacy and compliance concerns such as HIPAA and APRA mandate a highly secure environment. A zero trust environment is of utmost importance for Delphi and their customers. Delphi, was getting tremendous value from Kubernetes but needed to find an easier way to bake security into the infrastructure. Taking advantage of a service mesh was the obvious solution to address this challenge, as it provides cluster-wide mTLS encryption. The team chose Istio to solve this problem. The initial solution included setting up a certificate at the load balancer, but this had open http between the load balancer and service. Unfortunately, this was not acceptable in a highly regulated healthcare industry with strict requirements to keep personal data secure.

“At this point, I look at Aspen Mesh as an extension of my team”
- Bill Reeder, Delphi Technology Lead Architect 

The Solution

With the final solution in sight, Delphi engaged with Aspen Mesh to implement an end-to-end encrypted solution, from Client to back end SaaS applications. This was achieved by enabling mTLS mesh-wide from service to service and creating custom Istio policy manifests to integrate cert-manager and Letsencrypt for client-side encryption. As a result, Delphi is able to provide secure ingress integration for a multitenant B2C environment.This approach forwards encrypted AWS Elastic Load Balancer traffic to the Istio Ingress Gateway for TLS connection termination. The solution utilizes DNS resolution via Route53 to allow LetsEncrypt to validate the Certificate Signing Request and issue a certificate to cert-manager.The traffic then traverses one gateway resource per tenant (isolated client hosts) where each gateway contains its own certificate. This solution allows Delphi to deploy its own private key and certificate whenever a new tenant is created in the mesh, generating a fully scalable solution where cert-manager/Letsencrypt provides the certs or keys as desired. 

The Impact

The Aspen Mesh solution lets Delphi use Let’s Encrypt seamlessly with Istio. This has removed the need to consider building security into application development and placed it into an infrastructure solution that is highly scalable. Leveraging the power of Kubernetes, Istio and Aspen Mesh, the Delphi team is delivering a highly secure platform to their customers without the need to implement encryption in each service.