Delphi Simplifies Kubernetes Security with Aspen Mesh

How Delphi Simplifies Kubernetes Security with Aspen Mesh

Delphi's Mission

Delphi delivers software solutions that help professional liability insurers streamline their operations and optimize their business processes. Leveraging a highly flexible technology platform, Delphi enables companies to reduce costs, increase operational efficiency, and improve business intelligence. The Delphi Digital Platform is a cloud-based software solution that connects customers, agents, employees, and third parties to Delphi’s core transactional systems and other solutions in the digital ecosystem. This provides professional liability insurance carriers with modern microservice-based software solutions, giving them: 

  • The ability to link their business directly to their customers’ needs 
  • The flexibility to quickly respond to changing market conditions 
  • A cloud platform providing an environment for acquisition integration 

Delphi's Technology Stack

The infrastructure team at Delphi has fully embraced a cloud-native stack to deliver the Delphi Digital Platform to its customers. The team leverages Kubernetes to effectively manage builds and deploys. Delphi planned to use Kubernetes from the start, but was looking for a simpler security solution for their infrastructure that could be managed without implementations in each service. 

The Challenge

Operating in the highly regulated healthcare industry, privacy and compliance concerns such as HIPAA and APRA mandate a highly secure environment. A zero trust environment is of utmost importance for Delphi and their customers. Delphi, was getting tremendous value from Kubernetes but needed to find an easier way to bake security into the infrastructure. Taking advantage of a service mesh was the obvious solution to address this challenge, as it provides cluster-wide mTLS encryption. The team chose Istio to solve this problem. The initial solution included setting up a certificate at the load balancer, but this had open http between the load balancer and service. Unfortunately, this was not acceptable in a highly regulated healthcare industry with strict requirements to keep personal data secure.


“At this point, I look at Aspen Mesh as an extension of my team”
- Bill Reeder, Delphi Technology Lead Architect 

The Solution

With the final solution in sight, Delphi engaged with Aspen Mesh to implement an end-to-end encrypted solution, from Client to back end SaaS applications. This was achieved by enabling mTLS mesh-wide from service to service and creating custom Istio policy manifests to integrate cert-manager and Letsencrypt for client-side encryption. As a result, Delphi is able to provide secure ingress integration for a multitenant B2C environment.This approach forwards encrypted AWS Elastic Load Balancer traffic to the Istio Ingress Gateway for TLS connection termination. The solution utilizes DNS resolution via Route53 to allow LetsEncrypt to validate the Certificate Signing Request and issue a certificate to cert-manager.The traffic then traverses one gateway resource per tenant (isolated client hosts) where each gateway contains its own certificate. This solution allows Delphi to deploy its own private key and certificate whenever a new tenant is created in the mesh, generating a fully scalable solution where cert-manager/Letsencrypt provides the certs or keys as desired. 

The Impact

The Aspen Mesh solution lets Delphi use Let’s Encrypt seamlessly with Istio. This has removed the need to consider building security into application development and placed it into an infrastructure solution that is highly scalable. Leveraging the power of Kubernetes, Istio and Aspen Mesh, the Delphi team is delivering a highly secure platform to their customers without the need to implement encryption in each service. 

 



stock photo of people moving about on sidewalk

The 451 Take on cloud-native: It's truly transformative for enterprise IT (451 Research)

The 451 Take on cloud-native: truly transformative for enterprise IT

MARCH 15 2019 

By Jay Lyman, Fernando Montenegro, Matt Aslett, Owen Rogers, Melanie Posey, Brian Partridge, William Fellows, Simon Robinson, Mike Fratto, Liam Rogers  

Helping to shape the modern software development and IT operations paradigms, cloud-native represents a significant shift in enterprise IT. In this report, we define cloud-native and offer some perspective on why it matters and what it means for the industry. 

In this report, 451 Research presents our definition of cloud-native and the key technologies and methodologies that are representative of the trend, including containers, Kubernetes, service mesh and serverless. We recognize the importance of cloud-native based on our survey research and conversations with enterprise providers and end users. Containers and serverless are among the top IaaS features in use and planned for use, according to our Voice of the Enterprise: Digital Pulse, Budgets & Outlook, 2019 survey.  

Cloud-native technologies and methodologies – a departure from monolithic applications and waterfall release processes – are being driven by a desire for speed, efficiency, and support for applications and services that are distributed across hybrid infrastructure such as public clouds, private clouds and on-premises environments. There are, nevertheless, significant challenges with cloud-native approaches, mainly around complexity and lack of available skills and experience. Indeed, access to talent is becoming a key constraint for enterprises transforming around cloud and cloud-native (see Figure 1 below). 

We expect the cloud-native trend to continue to grow, fueled in part by intersections with adjacent technologies and trends, including data and analytics, AI and ML, security, and IoT/edge computing – all of which play a role in facilitating digital transformation. We also expect the cloud-native market, populated by a burgeoning number of startups, as well as established giants, to undergo consolidation as vendors seek to gain talent and the market matures. 

 

The 451 Take

Just like DevOps, cloud-native technologies and methodologies are now being attached to digital transformation efforts, and are expanding their presence in enterprise IT. Right now, the cloud-native trend consists mainly of containers, microservices, Kubernetes, service mesh and serverless, but we may see intersections of these different approaches with adjacent trends, as well as new ones. In addition to application development and deployment in the public cloud, cloud-native is connected to private and hybrid clouds and the ability to run applications consistently across different IT environments. Kubernetes, for example, is not only a container management and orchestration software, it is also a distributed application framework – one that is timed well with enterprise use of hybrid environments that span multiple clouds, as well as on-premises infrastructure. 

Cloud-native software is also closely intertwined with open source. Nearly all of the key software components are open source projects, and we believe open source to be table stakes for cloud-native software. There is still ample commercial opportunity around cloud-native. We would also highlight that cloud-native is also not limited to public cloud platforms, with on-premises environments increasingly serving as the basis for cloud-native approaches. We also see cloud-native crossing over with adjacent trends. Such intersections and integrations bode well for continued growth and significance of cloud-native approaches. It remains to be seen which approach within the cloud-native arena will be most effective and which combination of different technologies paves the best path forward for enterprise and service-provider organizations, but we will continue to track the technology, use cases and impact of cloud-native going forward, including survey data, market sizing and other research. 

Figure 1: Cloud Skills Gaps – Roadblock to Optimized Cloud Leverage 

Source: 451 Research, Voice of the Enterprise: Cloud, Hosting & Managed Services, Organizational Dynamics 2018 

graphic of cloud skills in short supply

Cloud-native defined

451 Research defines cloud-native software as: applications designed from the ground up to take advantage of cloud computing architectures and automated environments, and to leverage API-driven provisioning, auto-scaling and other operational functions. Cloud-native architecture and software include applications that have been redesigned to take advantage of cloud computing architectures, but are not limited to cloud applications – we see cloud-native technologies and practices present in on-premises environments in the enterprise. We can also define cloud-native by the technologies and approaches that characterize the trend, all intended to make software development and deployment more fluid and composable – containers, microservices, Kubernetes, service mesh and serverless. 

Our research and conversations indicate that these different types of cloud-native application development and deployment are by no means exclusive in enterprise organizations, which are typically leveraging multiple cloud-native technologies and methodologies across their many different releases and teams. Rather than competing components, tools and methods, the different technologies of cloud-native software are similar to hybrid cloud, which is representative of a best-tool-for-the-job or Best Execution Venue (BEV) approach. We also contend that cloud-native is far broader than application development and deployment. Cloud-native also includes application and infrastructure architecture and organizational approach. 

From an economic point of view, cloud-native technologies enable the true value of cloud by allowing applications to scale and evolve in much shorter timelines than previously. This scalability creates new opportunities for the business in terms of revenue growth, efficiency improvements or a better customer experience. However, cloud entropy means that scalability leads to great complexity, which is where the likes of Kubernetes, Istio, Prometheus and others come into play. The raison d’etre for these open source components is to keep track of the fluid and complex deployments of cloud-native services.  

In terms of applications, we see cloud-native methodologies and technologies used for a breadth of both internal and consumer-facing applications, led by data services and analytics applications, IT optimization and automation, digital user enhancement, and industry-specific software. 

The Spectrum of Abstraction

Contrary to the narrative that ‘serverless is killing containers,’ we don’t see the different approaches and technologies within cloud-native technology competing with or eliminating one another. The same way that containers are living alongside, and sometimes inside of, VMs is indicative of how all of the different aspects of cloud-native will coexist in a mixed-use market. No, serverless is not killing containers; serverless is built on containers. The main distinction between the two is the level of abstraction provided to the end user. Thus, we can also describe cloud-native as a set of technologies that fall somewhere on what we call the Spectrum of Abstraction.  

Figure 2 

Source: 451 Research, LLC 

spectrum of abstraction graphic

On one side of this spectrum is the DIY containers approach, whereby organizations leverage custom code and services and make their own choices on languages, frameworks and APIs. This approach is attractive for certain applications that require low latency, that run longer compute jobs, and for which high traffic can be predicted. On the other end of the spectrum, as functionality becomes more abstracted and invisible, are serverless functions and events, for which there are standardized and opinionated choices that are abstracted away from the end user. In between these two ends of the spectrum are still other levels of abstraction, such as supported Kubernetes distributions and container-as-a-service offerings from the large public cloud vendors and others. 

We typically see these different cloud-native technologies adopted in a specific order, starting with containers used for microservices, which break applications into smaller, loosely coupled services; then Kubernetes container orchestration and distributed application management for container clusters; followed by service mesh to abstract for developers and serverless to abstract for IT operators. However, we do see mixed use of the different approaches, and a leap whereby interested customers can skip ahead is feasible. For example, overheard at Kubecon/CloudNativeCon 2018 was the idea that organizations might be able to skip containers, microservices and the complexity of Kubernetes by simply adopting serverless. The reality is not that simple for most enterprise and service-provider organizations, which are more likely to be using the different technologies concurrently. 

There is some interesting tension between different approaches that are still playing out in the marketplace – for example, advocates of ‘single platform’ approaches to cloud-native, such as OpenStack, Pivotal/Cloud Foundry or Red Hat, versus loosely coupled models that will be composed of different coordinated parts. Both require a specific organizational model, and the success or otherwise of each has yet to be determined – enterprises are still undergoing transformation. 

Cloud-native isn't only in the cloud

Cloud-native does not necessarily mean applications run only on private or public cloud infrastructure. The hybrid cloud trend, which entails the use of a mix of public and private clouds with on-premises environments, dictates that enterprises will seek to run cloud-native applications atop on-premises infrastructure, as well. Vendors have responded aggressively with offerings such as Azure Software Stack, GKE on Prem and AWS Outposts. PaaS vendors, such as Red Hat with OpenShift or Pivotal with PCF, have also focused on the ability to run applications consistently across public clouds and on-premises infrastructure. In fact, our recent Voice of the Enterprise: Servers and Converged Infrastructure, Vendor Evaluations 2018 survey indicates continued growth of x86 servers and on-premises environments, with nearly one-third of organizations anticipating an increase in their x86 server deployments in the coming year. 

Further evidence of the ties between cloud-native and hybrid cloud can be found in our Voice of the Enterprise: Cloud, Hosting and Managed Services, Workloads and Key Projects 2018 survey, which indicates that most cloud-native software (32%) is designed to run effectively on any cloud environment, with another 22% designed to run effectively on any public cloud environment, rather than for a specific public cloud (30%) or private cloud (17%). 

Cloud-native with adjacent trends/sectors

Data, AI and ML 

The dynamism that cloud-native architecture and containers provide is ideal for stateless web applications, but it can be problematic for stateful database workloads, given the need for a persistent connection between the application and its associated data volume. Kubernetes, in particular, has been at the forefront of containerization of stateful services, providing elements for persistence and cluster lifecycle management that enable custom deployments for individual databases that could be the beginning of a viable long-term approach. Database vendors are beginning to update their products to take advantage of these features. However, inherent challenges remain in getting databases and containers to work together, and vendors, enterprises and industry consortia must work together to continue to evolve Kubernetes, in order to provide a general-purpose environment for the containerization of multiple stateful services. 

Cloud-native methodology and software are also crossing over with artificial intelligence and machine learning, including integrations of TensorFlow, an open source machine learning library, and projects such as Kubeflow for machine learning on Kubernetes. The combination enables data scientists to create and train models in self-contained environments with the necessary data and dependencies; these can then be deployed into production via Kubernetes, which provides autoscaling, failover, and infrastructure monitoring and management, as well as exaction venue abstraction. 

Security 

Increased adoption of cloud-native technology and delivery patterns will deeply influence how organizations think about security, even as key security principles, such as the need to maintain confidentiality, integrity and availability, remain. The scope of changes will affect both security technology and practices. On the technology front, the key cloud-native technologies (containers, Kubernetes, service mesh and others) have incorporated some security functionality themselves – service mesh supports workload identity and encryption, while Kubernetes includes several policy constructs. This will affect organizations deploying these technologies, as well as vendor offerings, since that functionality becomes the reference point for additional functionality and design decisions. Particularly as organizations adopt high-level services and abstractions (containers as a service and serverless), the focus of security shifts much more to application-level security and data security. This is a shift away from traditional infrastructure security considerations. Lastly, the quickened pace associated with cloud-native deployments will deeply affect security teams – not only will they need to skill up in cloud-native technologies and patterns, but the very pace of deployment will require teams to rethink how they interact with the rest of IT, and what role security can actually play. 

IoT and edge computing 

While the timing of their arrival on the IT scene was coincidental, it’s as though containerization and IoT were born to be together as the match between capability (containers) and need (IoT app developers). The trends are well-aligned as the IoT industry matures, scales, and requires a complicated tapestry of computing venues depending on context and use case. 

We believe the successful future of IoT is linked with timely adoption of cloud-native techniques to support the speed and diversity of IoT apps. The reality that a nontrivial portion of IoT apps will actually fail means reducing the cost of doing so is a high priority, and there is a need for iterative updates to software based on feedback from ‘the field.’ There is also a requirement for a small operating system footprint for low-power edge devices; support for microservices to enable the data- and messaging-intensive characteristics of IoT across and within multiple actors; and platform-independent runtime support using container technologies and orchestration to ensure that workloads are run on the optimal computing platform at the edge, near edge or centralized core. 

Networking 

There are still significant challenges to cloud-native networking, whether in a cloud service, an on-premises or colocation cloud environment, a virtual machine-based cloud, a container-based cloud, or a mix of services and on-premises. Enterprise IT prefers consistency in capabilities, but cloud-native environments have basic networking capabilities that established networking vendors have been attempting to address by integrating their switch and management software with the container environment and the container management framework. These products unify networking workflows and are familiar to IT, but can also inhibit IT from moving past its traditionally managed infrastructure, which is rigid and slow to adapt to changes. Layer on top service mesh, which offers a more robust technology for cloud-native infrastructure and provides a useful abstraction between application connectivity and the physical or virtual paths interconnecting software and hardware, and much of the intelligent networking capabilities in the physical underlay become irrelevant in the application layer. 

There are opportunities for application delivery controller (ADC) vendors that can deeply embed themselves into enterprise IT by offering to offload a number of critical capabilities from application owners, such as intelligent load-balancing, high availability and security functions, to purpose-built platforms that can augment applications and keep developers focused on building features versus infrastructure. ADC vendors are also finding ways to embed their products into application infrastructure by enabling scale-out architecture via robust APIs and replacing container environment components like the ingress controller to a container pod. 

Storage 

There is a shift in how storage is being run as both startups and established vendors offer more storage capabilities (ranging from the storage controller to the backup application) in containers. The alternative is to have them run in VMs, as one would find in HCI-style deployments, or on a dedicated operating system like in proprietary appliances. This brings new flexibility to storage management since the various capabilities of storage platforms can be orchestrated and automated using the same tooling as the applications they are supporting. 

Another consideration in the storage industry is providing containerized applications with storage as vendors evolve their offerings to take into account Docker volume drivers and Kubernetes Container Storage Interface drivers to support flexible storage consumption for containerized, stateful applications. This will be increasingly important as containers are used for stateful applications, whether they are net new or traditional and legacy apps that are being containerized for use in the cloud. 

Heavily open source

Considering the most successful software components of cloud-native, open source software is a critical part of the trend. Nearly all cloud-native software components are open source, including Docker containers, Kubernetes management and orchestration, Helm package management, Prometheus monitoring, Istio service mesh, and Knative serverless. It is also noteworthy in the context of cloud-native that modern open source software projects and communities include not only vendors, but also end users, which are among project supporters and sponsors in the cloud-native market. The open source nature of cloud-native also means that traditional rivals, such as Microsoft and Google or Pivotal and Red Hat, work together on many of these open source projects in the cloud-native ecosystem. Cloud-native is also all about collaboration, meaning it must accommodate DevOps by offering something for developers and IT operators, as well as other stakeholders, including security teams, data analytics and data science teams, and line-of-business leaders. 

Cloud-native competition and outlook

The industry is moving toward containers, microservices, Kubernetes, serverless and other cloud-native constructs. While there are other flavors available, Kubernetes has the wind in its sails and has all but won the battle for container orchestration. Many cloud-native entrants have a ‘Kubernetes first’ posture in terms of platform architecture and service delivery. Incumbent vendors, service providers and integrators are rewriting and retooling for cloud-native. Cloud-native is a part of every conversation with customers. Most enterprises are already working at some level with cloud-native constructs and exploring what new outcomes can be achieved. Every company is becoming a service provider – seeking to better engage with customers, partners, and suppliers with new digital services and experiences, and to compete in the digital economy. Companies will need to raise their software IQ, and cloud-native will be the basis of this, supported on the cloud operating and delivery model. Cloud-native practices such as CI/CD enable companies to access speed and agility not previously available, and will require new organizational approaches to development. 

With many vendors across the different subsegments (containers, Kubernetes, service mesh and serverless), we expect further consolidation of the market. The need for cloud-native talent and expertise – our VotE survey data indicates cloud functions/tools such as containers and microservices are among the most acute skills shortages – will also likely drive mergers and acquisitions in the space. However, it may take some time since different enterprise and service-provider customers have very different needs, and thus support a broad array of providers in the market. The cloud-native market is highly competitive, with no dominant player yet established, although the hyperscale public cloud providers and large vendors that embraced containers early on are the clear leaders. 

We also expect that, driven largely by digital transformation and the need to embrace and leverage new technology, cloud-native approaches will more deeply permeate large enterprise organizations. Similar to the DevOps trend, this means increasingly pulling in additional stakeholders, including administrators and line-of-business leaders. This means cloud-native technology and methodology will probably follow the pattern of agile and DevOps to reach half or more of organizations within the next few years. It is also important to note that the concept of cloud-native was meant to mean more than containers, Kubernetes or serverless, leaving room for the next technology, which may be a combination of existing ones; integration with adjacent trends, such as DevSecOps, data analytics, AI and ML; or something currently unknown. 

 



photo of circuit board

Service mesh update: Maintainers add features while practitioners push federation (451 Research)

Service mesh update: Maintainers add features while practitioners push federation (451 Research)

Analysts – Jean Atelsek, William Fellows 

Publication date: Wednesday, December 4 2019 

Introduction

Cloud adopters are enthusiastic about the promise of service mesh to consistently apply routing, policy and encryption across microservices-based applications, but implementation has been difficult due to fiddly configuration and management demands. Add to this competing control plane options – Istio, Consul, Kuma, Linkerd, NSX and AWS’s proprietary App Mesh – at various stages of adoption and maturity, and you get a perfect storm of confusion; dare we say a bit of a ‘service mess.’ This is to be expected at the current stage of market development. It’s a market that is being made up as we go – it is thrashing, crowded and complex. There’s lots of confusion; clean, simple stories will be successful here. At KubeCon 2019 in San Diego, maintainers introduced tools to make their offerings easier to love, while practitioners cited the need for an open standard that can federate various preferences across environments. 

The 451 Take

Service mesh was a prominent topic at this year’s KubeCon North America, complete with its own Day Zero event (ServiceMeshCon), a CNCF roundtable and a raft of announcements from project maintainers. In a show of hands, about 10% of attendees to the sold-out ServiceMeshCon said they had experience with service mesh in production – about 50% had tried it out. The landscape seems to be branching out in several directions, with open source projects adding tools to ease adoption of their control planes, vendors hoping to capitalize on service mesh difficulty by offering to run it as a service on behalf of enterprises, and other participants promising to reconcile the various offerings with the help of an overarching specification. While few dispute the need for a way to route, monitor and authenticate traffic for service-to-service communications, the way forward for most organizations is far from clear, indicating opportunity as well as risk, although the industry has converged on sidecar proxies (primarily Envoy) as the best available choice for the data plane. 

 

Context

With the use of a service mesh important to successful microservices implementations, data from 451 Research’s Voice of the Enterprise: DevOps, 2H 2019 survey finds that 13.9% of enterprises are now in production with service mesh, 18.6% have some adoption and about 44% are in planning. 

Please indicate your organization’s adoption status for service mesh 

most important cloud technology graphic

Source: 451 Research’s Voice of the Enterprise: DevOps, 2H 2019 

Lessons learned

It’s telling that most of the service mesh practitioners speaking at KubeCon were from large, technically sophisticated cloud-native organizations such as Lyft, Uber and Pinterest. Although many vendors are pursuing the opportunity to bridge the world of highly scalable cloud-native environments with on-premises data and legacy applications – a mesh is, after all, only as strong as its weakest link – advice gathered from organizations that have implemented service meshes at scale is instructive. 

  • Collaborate with stakeholders starting early in the process. Tech talks, prototyping and encouraging opt-in by service owners who have the most to gain (e.g., supporting a new language or functionality) will help get the implementation off on the right foot. Proactively identify those most likely to be affected. 
  • Start with an ingress solution. Establish a consistent way for external applications to call into the mesh. Vendors that can deliver an ‘easy button’ on-boarding run book for customers seeking to get started with service mesh will find beginning with ingress to be a useful first step. 
  • Prioritize security for services and for the service mesh itself. A primary use case for service mesh is to ensure mTLS encryption of service-to-service traffic; application and sidecar communications need to be rock solid. With so many layers of software-defined interaction, bugs can arise from many sources. Have a systematic way of testing for and finding the source of problems. 
  • Be careful with migration. Service mesh involves a big change in how services communicate with each other. Planning ahead requires service discovery, service registration and security infrastructure to be in place. 
  • Disable unused components. Some service mesh features can cause problems during implementation even if they’re not active; use the simplest set of tools that can address the problem you’re trying to solve. 
  • Never stop investing in performance improvements. The main downsides of service mesh are latency (multiplied by the number of hops in an application) and resource consumption (multiplied by the number of sidecars); batch chatty connections when possible. 
  • Roll out slowly, start small, scale up. Begin with a use case that’s not in the critical path. As problems are ironed out of initial deployments, iterate quickly and scale to other applications/teams. Doing service mesh for one application or team may mean you can end up with a pet, not cattle. 
  • Plan an update process. Roll out updates slowly, qualifying new releases with critical users first. Allow users to do self-service rollbacks to a specific supported version, and keep track of how many users rolled back a given version to point to widespread difficulties. Fix user issues as soon as possible and ensure that rollbacks are temporary. 
  • Be especially careful with newly opened connections. This is where errors are mostly likely to be introduced. 
  • Keep the faith. Despite the difficulties, service mesh adopters say the benefits make the difficulties worthwhile. 

 

Incremental improvements

Some vendors expect the industry to settle on a single standard, as it did with Kubernetes for container orchestration, and Istio has the pole position as a Google-driven project that plays well with Kubernetes. Google’s decision to keep Istio under its own control for now (rather than donating it to the CNCF under an open governance model) worries some potential customers and makes it a nonstarter for others, but many players (including Tetrate, Aspen Mesh, VMware with NSX Service Mesh and IBM with App Connect) are investing in Istio as a foundation for enterprise-grade managed services to support heterogeneous environments. 

Others expect there to remain a variety of service meshes to address a variety of use cases. Given that businesses are already using a variety of control planes in production, Microsoft introduced the Service Mesh Interface (SMI) project in May, a specification for interoperability across different mesh technologies, including Istio, Linkerd and Consul Connect. The project was launched in partnership with Buoyant, Hashicorp, Solo.io, Kinvolk and Weaveworks, with support from Aspen Mesh, Canonical, Docker, Pivotal, Rancher, Red Hat and VMware. The goal of SMI is for developer-friendly APIs to lower the barrier to entry and risk of using a service mesh, collaborate with the service mesh community on customer requests, and create a consistent experience across a new ecosystem with an interoperable, extensible framework. Microsoft provided a demo at ServiceMeshCon, but it won’t see the light of day until 2020. 

Among the new projects and features introduced by service mesh maintainers at KubeCon: 

  • Solo.io, which does not have its own mesh, but offers a Service Mesh Hub dashboard that installs, discovers, manages and groups diverse meshes (including AWS’s App Mesh) together into one big mesh, announced AutoPilot, an operator framework for building workflows on top of service mesh. AutoPilot will help Kubernetes operators to enable mesh metrics and APIs, automated mesh configuration, the ability to expose and invoke webhooks, and out-of-the-box GitOps workflows. The plan is to use telemetry within Kubernetes clusters to drive the behavior of the service mesh for what Solo.io calls ‘adaptive service mesh’ 
  • Buoyant, maker of Linkerd (one of the few service meshes that doesn’t use the Envoy proxy as a data plane) introduced Dive, a team collaboration tool that captures microservice deployments as events and compiles ownership information and dependencies into a service catalog – ‘like a Facebook for microservices.’ Dive is free and in private beta; there is currently a waitlist for the beta. 
  • Network Service Mesh, a CNCF sandbox project announced in 2018, has attracted 40 contributors and is reportedly receiving interest from financial companies, enterprises and service providers. The project is designed to manage complicated layer 2 and layer 3 use cases in Kubernetes so app service meshes can focus on layer 7 connectivity. 
  • VMware’s NSX Service Mesh is a SaaS offering that runs in public clouds. Based on Istio and Envoy, NSX Service Mesh expands observability and policies to users, data and services, in addition to federation between service mesh clusters. It provides the ability for SecOps and DevOps integrations through policies and tools that allow them to set up application SLOs, access control, encryption and context-based security policies. NSX Service Mesh is built on a global control pane with the agents running on any Kubernetes cluster on any cloud. VMware sees key use cases including application mobility and migration, service mesh HA, E2E encryption for compliance, and visibility for Dev/SecOps. 



Sailing Faster with Istio

While the extraordinarily large shipping container, Ever Given, ran aground in the Suez Canal, halting a major trade route that has caused losses in the billions, our solution engineers at Aspen Mesh have been stuck diagnosing a tricky Istio and Envoy performance bottleneck on their own island for the past few weeks. Though the scale and global impacts of these two problems is quite different, it has presented an interesting way to correlate a global shipping event with the metaphorical nautical themes used by Istio. To elaborate on this theme, let’s switch from containers carrying dairy, and apparently everything else under the sun, to containers shuttling network packets.

To unlock the most from containers and microservices architecture, Istio (and Aspen Mesh) uses a sidecar proxy model. Adding sidecar proxies into your mesh provides a host of benefits, from uniform identity to security to metrics and advanced traffic routing. As Aspen Mesh customers range from large enterprises all the way to service providers, the performance impacts of adding these sidecars is as important to us as the benefits outlined above. The performance experiment that I’m going to cover in this blog is geared toward evaluating the impact of adding sidecar proxies in high throughput scenarios on the server or client, or both sides.

We have encountered workloads, especially in the service provider space, where there are high requests or transactions-per-second requirements for a particular service. Also, scaling up — i.e., adding more CPU/memory — is preferable to scaling out. We wanted to test the limits of sidecar proxies with regards to the maximum achievable throughput so that we can tune and optimize our model to meet the performance requirements of the wide variety of workloads used by our customers.

Throughput Test Setup

The test setup we used for this experiment was rather simple: a Fortio client and server running on Kubernetes on large AWS node instance types like burstable t3.2xlarge with 8 vCPUs and 32 GB of memory or dedicated m5.8xlarge instance types which have 32 vCPUs and 128 GB of memory. The test was running a single instance of the Fortio client and server pod with no resource constraints on their own dedicated nodes. The Fortio client was run in a mode to maximize throughput like this:

The above command runs the test for 60 seconds with queries per second (QPS) 0 (i.e. maximum throughput with a varying number of simultaneous parallel connections). With this setup on a t3.2xlarge machine, we were able to achieve around 100,000 QPS. Further increasing the number of parallel connections didn’t result in throughput beyond ~100K QPS, signaling a possible CPU bottleneck. Running the same experiment on an m5.8xlarge instance, we could achieve much higher throughput around 300,000 QPS or higher depending upon the parallel connection settings.

This was sufficient proof of CPU throttling. As adding more CPUs increased the QPS, we felt that we had a reasonable baseline to start evaluating the effects of adding sidecar proxies in this setup.

Adding Sidecar Proxies on Both Ends

Next, with the same setup on t3.2xlarge instances, we added Istio sidecar proxies on both Fortio client and server pods with Aspen Mesh default settings; mTLS STRICT setting, access logging enabled and the default concurrency (worker threads) of 2. With these parameters, and running the same command as before, we could only get a maximum throughput of around ~10,000 QPS.

This is a factor of 10 reduction in throughput. This was expected as we had only configured two worker threads, which were hopefully running at their maximum capacity but could not keep up with client load.

So, the logical next step for us was to increase the concurrency setting to run more worker threads to accept more connections and achieve higher throughput. In Istio and Aspen Mesh, you can set the proxy concurrency globally via the concurrency setting in proxy config under mesh config or override them via pod annotations like this:

Note that using the value “0” for concurrency configures it to use all the available cores on the machine. We increased the concurrency setting from two to four to six and saw a steady increase in maximum throughput from 10K QPS to ~15K QPS to ~20K QPS as expected. However, these numbers were still quite low (by a factor of five) as compared to the results with no sidecar proxies.

To eliminate the CPU throttling factor, we ran the same experiment on m5.8xlarge instances with even higher concurrency settings but the maximum throughput we could achieve was still around ~20,000 QPS.

This degradation was far from acceptable, so we dug into why the throughput was low even with sufficient worker threads configured on the sidecar proxies.

Peeling the Onion

To investigate this issue, we looked at the CPU utilization metrics in the server pod and noticed that the CPU utilization as a percentage of total requested CPUs was not very high. This seemed odd as we expected the proxy worker threads to be spinning as fast as possible to achieve the maximum throughput, so we needed to investigate further to understand the root cause.

To get a better understanding of low CPU utilization, we inspected the connections received by the server sidecar proxy. Envoy’s concurrency model relies on the kernel to distribute connections between the different worker threads listening on the same socket. This means that if the number of connections received at the server sidecar proxy is less than the number of worker threads, you can never fully use all CPUs.

As this investigation was purely on the server-side, we ran the above experiment again with the Fortio client pod, but this time without the sidecar proxy injected and only the Fortio server pod with the proxy injected. We found that the maximum throughput was still limited to around ~20K QPS as before, thereby hinting at issues on the server sidecar proxy.

To investigate further, we had to look at connection level metrics reported by Envoy proxy. Later in this article, we’ll see what happens to this experiment with Envoy metrics exposed. (By default, Istio and Aspen Mesh don’t expose the connection-level metrics from Envoy.)

These metrics can be enabled in Istio version 1.8 and above by following this guide and adding the appropriate pod annotations corresponding to the metrics you want to be exposed. Envoy has many low-level metrics emitted at high resolution that can easily overwhelm your metrics backend for a moderately sized cluster, so you should enable this cautiously in production environments.

Additionally, it can be quite a journey to find the right Envoy metrics to enable, so here’s what you will need to get connection-level metrics. On the server-side pod, add the following annotation:

This will enable reporting for all listeners configured by Istio, which can be a lot depending upon the number of services in your cluster, but only enable the downstream connections total counter and downstream connections active gauge metrics.

To look at these metrics, you can use your Prometheus dashboard, if it’s enabled, or port-forward to the server pod under test to port 15000 and navigate to http://localhost:15000/stats/prometheus. As there are many listeners configured by Istio, it can be tricky to find the correct one. Here’s a quick primer on how Istio sets up Envoy configuration. (You can find the complete list of Envoy listener metrics here.)

For any inbound connections to a pod from clients outside of the pod, Istio configures a virtual inbound listener at 0.0.0.0:15006, which receives all the traffic from iptables’ redirect rules. This is the only listener that’s actually configured to receive connections from the kernel, and after the connection is received, it is matched against filter chain attributes to proxy the traffic to the correct application port on localhost. This means that even though the Fortio client above is targeting port 8080, we need to look at the total and active connections for the virtual inbound listener at 0.0.0.0:15006 instead of 0.0.0.0:8080. Looking at this metric, we found that the number of active connections were close to the configured number of simultaneous connections on the Fortio client side. This invalidated our theory about the number of connections being less than worker threads.

The next step in our debugging journey was to look at the number of connections received on each worker thread. As I had alluded to earlier, Envoy relies on the kernel to distribute the accepted connections to different worker threads, and for all the worker threads to be fully utilizing the allotted CPUs, the connections also need to be fairly balanced. Luckily, Envoy has per-worker metrics for listeners that can be enabled to understand the distribution. Since these metrics are rooted at listener.<address>.<handler>.<metric name>, the regex provided in the annotation above should also expose these metrics. The per-worker metrics looked like this:

As you can see from the above image, the connections were far from being evenly distributed among the worker threads. One thread, worker 10, had 11.5K active connections as compared to some threads which had around ~1-1.5K active connections, and others were even lower. This explains the low CPU utilization numbers as most of the worker threads just didn’t have enough connections to do useful work.

In our Envoy research, we quickly stumbled upon this issue, which very nicely sums up the problem and the various efforts that have been made to fix it.

Image via Pixabay.

So, next, we went looking for a solution to fix this problem. It seemed like, for the moment, our own Ever Given was stuck as some diligent worker threads struggled to find balance. We needed an excavator to start digging.

While our intrepid team tackled the problem of scaling for high-throughput workloads by adding sidecar proxies, we encountered a bottleneck not entirely unlike what the Ever Given experienced not long ago in the Suez Canal.

Luckily, we had a few more things to try, and we were ready to take a closer look at the listener metrics.

Let There Be Equality Among Threads!

After parsing through the conversations in the issue, we found the pull request that enabled a configuration option to turn on a feature to achieve better balancing across worker threads. At this point, trying this out seemed worthwhile, so we looked at how to enable this in Istio. (Note that as part of this PR, the per-worker thread metrics were added, which was useful in diagnosing this problem.)

For all the ignoble things EnvoyFilter can do in Istio, it’s useful in situations like these to quickly try out new Envoy configuration knobs without making code changes in “istiod” or the control plane. To turn the “exact balance” feature on, we created an EnvoyFilter resource like this:

With this configuration applied and with bated breath, we ran the experiment again and looked at the per-worker thread metrics. Voila! Look at the perfectly balanced connections in the image below:

Measuring the throughput with this configuration set, we could achieve around ~80,000 QPS, which is a significant improvement over the earlier results. Looking at CPU utilization, we saw that all the CPUs were fully pegged at or near 100%. This meant that we were finally seeing the CPU throttling. At this point, by adding more CPUs and a bigger machine, we could achieve much higher numbers as expected. So far so good.

As you may recall, this experiment was purely to test the effects of server sidecar proxy, so we removed the client sidecar proxy for these tests. It was now time to measure performance with both sidecars added.

Measuring the Impacts of a Client Sidecar Proxy

With this exact balancing configuration enabled on the inbound port (server side only), we ran the experiment with sidecars on both ends. We were hoping to achieve high throughputs that could only be limited by the number of CPUs dedicated to Envoy worked threads. If only things were that simple.

We found that the maximum throughput was once again capped at around ~20K QPS.

A bit disappointing, but since we then knew about the issue of connection imbalance on the server side, we reasoned that the same could happen on the client side between the application and the sidecar proxy container on localhost. First, we enabled the following metrics on the client-side proxy:

In addition to the listener metrics, we also enabled cluster-level metrics, which emit total and active connections for any upstream cluster. We wanted to verify that the client sidecar proxy was sending a sufficient number of connections to the upstream Fortio server cluster to keep the server worker threads occupied. We found that the number of active connections mirrored the number of connections used by the Fortio client in our command. This was a good sign. Note that Envoy doesn’t report cluster-level metrics at the per-worker level, but these are all aggregated, so there’s no way for us to know how the connections were distributed on the outbound side.

Next, we inspected the listener connection statistics on the client side similar to the server side to ensure that we were not having connection imbalance issues. The outbound listeners, or the listeners set up to handle traffic originating from the application in the same pod as the sidecar proxy, are set up a bit differently in Istio as compared to the inbound side. For outbound traffic, a virtual listener “0.0.0.0:15001” is created similar to the listener on “0.0.0.0:15006,” which is the target for iptables redirect rules. Unlike the inbound side, the virtual listener hands off the connection to the more specific listener like “0.0.0.0:8080” based on the original destination address. If there are no specific matches, then the listener configuration in the virtual outbound takes effect. This can block or allow all traffic depending on your configured outbound traffic policy. In the traffic flow from the Fortio client to server, we expected the listener at “0.0.0.0:8080” to be handling connections on the client-side proxy, so we inspected connections metrics at this listener. The listener metrics looked like this:

The above image shows the connection imbalance issue between worker threads as we saw it on the server side. However, the connections on the outbound client-side proxy were only getting handled by one worker thread which explains the poor throughput QPS numbers. Having fixed this on the server-side, we applied a similar EnvoyFilter configuration with minor tweaks for context and port to address this imbalance:

Surely, applying this resource would fix our issue and we would be able to achieve high QPS with both client and server sidecar proxies with sufficient CPUs allocated to them. Well, we ran the experiment again and saw no difference in the throughput numbers. Checking the listener metrics again, we saw that even with this EnvoyFilter resource applied, only one worker thread was handling all the connections. We also tried applying the exact balance config on both virtual outbound port 15001 and outbound port 8080, but the throughput was still limited to 20K QPS.

This warranted the next round of investigations.

Original Destination Listeners, Exact Balance Issues

We went around looking in Envoy code and opened Github issues to understand why the client-side exact balance configuration was not taking effect, while the server side was working wonders. The key difference between the two listeners, other than the directionality, was that the virtual outbound listener “0.0.0.0:15001” was an original destination listener, which hands over connections to other listeners matched on the original destination address. With help from the Istio community (thanks, Yuchen Dai from Google), we found this open issue, which explains this behavior in a rather cryptic way.

Basically, the current exact balance implementation relies on connection counters per worker thread to fix the imbalance. When the original destination is enabled on the virtual outbound listener, the connection counter on the worker thread is incremented when a connection is received, but as the connection is immediately handed to the more specific listener like “0.0.0.0:8080,” it is decremented again. This quick increase and decrease in the internal count spoofs the exact balancer into thinking the balance is perfect as all these counters are always at zero. It also appears that applying the exact balance on the listener that handles the connection, “0.0.0.0:8080” in this case, but doesn’t accept the connection from the kernel has no effect due to current implementation limitations.

Fortunately, the fix for this issue is in progress, and we’ll be working with the community to get this addressed as quickly as possible. In the meantime, if you’re getting hit by these performance issues on the client side, scaling out with a lower concurrency setting is a better approach to reach higher throughput QPS numbers than scaling up with higher concurrency and worker threads. We are also working with the Istio community to provide configuration knobs for enabling exact balance in Envoy to optionally switch default settings so that everyone can benefit from our findings.

Working on this performance analysis was interesting and a challenge in its own way, like the small tractor next to the giant ship trying to make it move.

Well, maybe not exactly, but it was a learning experience for me and my team, and I’m glad we are able to share our learnings with the rest of the community as this aspect of Istio is often overlooked by the broader vendor ecosystem. We will run and publish performance numbers related to the impact of turning on various features such as mTLS, access logging and tracing in high-throughout scenarios in future blogs, so if you’re interested in this topic, subscribe to our blog to get updates or reach out to us with any questions.

Thank you Aspen Mesh team members Pawel and Bart who patiently and diligently ran various test scenarios, collected data and were uncompromising in their pursuit to get the last bit out of Istio and Aspen Mesh. It’s not surprising. After all, being part of F5, taking performance seriously is just part of our DNA. 


Installing Multicluster Aspen Mesh on KOPS Cluster

Installing Multicluster Aspen Mesh on KOPS Cluster

I recently tried installing Aspen Mesh on multicluster, and it was easier that I anticipated. In this post, I will walk you through my process. You can read the original version of this process here.

Firstly, ensure that you have two Kubernetes clusters with same version of Aspen Mesh installed on each of them (if you need an Aspen Mesh account, you can get a free 30-day trial here). Once you have an account, refer to the documentation for installing Aspen Mesh on your cluster.

kops get cluster

ssah-test1.dev.k8s.local        aws    us-west-2a
ssah-test2.dev.k8s.local        aws    us-west-2a

There are multiple ways to configure Aspen Mesh on a multicluster environment. In the following example, I have installed Aspen Mesh 1.9.1-am1 on both of my clusters, and the installation type is Multi-Primary on different network.

Pre-requisites for the Setup:

  • API: the server of each cluster must be able to access the API server of other cluster.
  • Trust: Trust must be established between all clusters in the mesh. This is achieved by having a common Root CA to generate intermediate certs for each clusters.

Configuring Trust:

I am creating an RSA type certificate for my root cert.  After I have downloaded and extracted the Aspen Mesh binary, I create a cert folder and add the folder to the directory stack.

mkdir -p certs
pushd certs

The binary downloaded should have a tools directory to create your certificate. You run the make command to create a root-ca folder, which will consist of four files: root-ca.conf, root-cert.csr, root-cert.pem and root-key.pem. For each of your clusters, you will need to generate an intermediate cert and key for Istio CA.

make -f ../tools/certs/Makefile.selfsigned.mk root-ca
make -f ../tools/certs/Makefile.selfsigned.mk cluster1-cacerts
make -f ../tools/certs/Makefile.selfsigned.mk cluster2-cacerts

You will then have to create secrets for each of your clusters in the istio-system namespace with all the input files that we generated from the last step. These secrets at each of the clusters is what configures trust between them as the same root-cert.pem is used to create the intermediate cert.

kubectl  create secret generic cacerts -n istio-system \\n  --from-file=ca-cert.pem \\n  --from-file=ca-key.pem \\n  --from-file=root-cert.pem \\n  --from-file=cert-chain.pem --context="${CTX_CLUSTER1}"

kubectl create secret generic cacerts -n istio-system \\n  --from-file=ca-cert.pem \\n  --from-file=ca-key.pem \\n  --from-file=root-cert.pem \\n  --from-file=cert-chain.pem --context="${CTX_CLUSTER2}"

Next, we will move on to the Aspen Mesh configuration, where we are enabling multicluster for istiod and giving names to the network and cluster. Add following fields in your override file which will be used during Helm installation/upgrade. Create a separate file for each cluster. You will also need to label the istio-system namespace in both of your clusters with appropriate label.

kubectl --context="${CTX_CLUSTER1}" label namespace istio-system topology.istio.io/network=network1

kubectl --context="${CTX_CLUSTER2}" label namespace istio-system topology.istio.io/network=network2

For Cluster 1

#Cluster 1

#In order to make the application service callable from any cluster, the DNS lookup must succeed in each cluster
#This provides DNS interception for all workloads with a sidecar, allowing Istio to perform DNS lookup on behalf of the application.
meshConfig:
  defaultConfig:
    proxyMetadata:
    # Enable Istio agent to handle DNS requests for known hosts
    # Unknown hosts will automatically be resolved using upstream dns servers in resolv.conf
      ISTIO_META_DNS_CAPTURE: "true"

global:
  meshID: mesh1
  multiCluster:
    # Set to true to connect two kubernetes clusters via their respective
    # ingressgateway services when pods in each cluster cannot directly
    # talk to one another. All clusters should be using Istio mTLS and must
    # have a shared root CA for this model to work.
    enabled: true
    # Should be set to the name of the cluster this installation will run in. This is required for sidecar injection
    # to properly label proxies
    clusterName: "cluster1"
    globalDomainSuffix: "local"
    # Enable envoy filter to translate `globalDomainSuffix` to cluster local suffix for cross cluster communication
    includeEnvoyFilter: false
  network: network1

For Cluster 2

#Cluster 2

#In order to make the application service callable from any cluster, the DNS lookup must succeed in each cluster
#This provides DNS interception for all workloads with a sidecar, allowing Istio to perform DNS lookup on behalf of the application.
meshConfig:
  defaultConfig:
    proxyMetadata:
    # Enable Istio agent to handle DNS requests for known hosts
    # Unknown hosts will automatically be resolved using upstream dns servers in resolv.conf
      ISTIO_META_DNS_CAPTURE: "true"

global:
  meshID: mesh1
  multiCluster:
    # Set to true to connect two kubernetes clusters via their respective
    # ingressgateway services when pods in each cluster cannot directly
    # talk to one another. All clusters should be using Istio mTLS and must
    # have a shared root CA for this model to work.
    enabled: true
    # Should be set to the name of the cluster this installation will run in. This is required for sidecar injection
    # to properly label proxies
    clusterName: "cluster2"
    globalDomainSuffix: "local"
    # Enable envoy filter to translate `globalDomainSuffix` to cluster local suffix for cross cluster communication
    includeEnvoyFilter: false
  network: network2

Now we will upgrade/install the istiod manifest with the newly added configuration from the override file. As you can see, I have separate override files for each cluster.

helm upgrade istiod manifests/charts/istio-control/istio-discovery -n istio-system --values sample_overrides-aspenmesh_2.yaml

helm upgrade istiod manifests/charts/istio-control/istio-discovery -n istio-system --values sample_overrides-aspenmesh.yaml

Check the pods in the istio-system namespace to see if all are in a running state. Be sure to delete all your application pods in your default namespace for the new configuration to kick in when the new pods will be spun. You can also check to see if the root cert used for pods in each cluster is the same. I am using pods from the bookinfo sample application.

istioctl pc secrets details-v1-79f774bdb9-pqpjw -o json | jq '[.dynamicActiveSecrets[] | select(.name == "ROOTCA")][0].secret.validationContext.trustedCa.inlineBytes' -r | base64 -d | openssl x509 -noout -text | md5

istioctl pc secrets details-v1-79c697d759-tw2l7 -o json | jq '[.dynamicActiveSecrets[] | select(.name == "ROOTCA")][0].secret.validationContext.trustedCa.inlineBytes' -r | base64 -d | openssl x509 -noout -text |md5

Once the istiod is upgraded, we will move on to creating an ingress gateway used for communication between two clusters via installing an east-west gateway. Use the configuration below to create a yaml file which will be used with Helm to install in each of the clusters. I have created two yaml files: cluster1_gateway_config.yaml and cluster2_gateway_config.yaml which will be used with respective clusters.

For Cluster 1

#This can be on separate override file as we will install a custom IGW
gateways:
  istio-ingressgateway:
    name: istio-eastwestgateway
    labels:
      app: istio-eastwestgateway
      istio: eastwestgateway
      topology.istio.io/network: network1
    ports:
    ## You can add custom gateway ports in user values overrides, but it must include those ports since helm replaces.
    # Note that AWS ELB will by default perform health checks on the first port
    # on this list. Setting this to the health check port will ensure that health
    # checks always work. https://github.com/istio/istio/issues/12503
    - port: 15021
      targetPort: 15021
      name: status-port
      protocol: TCP
    - port: 80
      targetPort: 8080
      name: http2
      protocol: TCP
    - port: 443
      targetPort: 8443
      name: https
      protocol: TCP
    - port: 15012
      targetPort: 15012
      name: tcp-istiod
      protocol: TCP
    # This is the port where sni routing happens
    - port: 15443
      targetPort: 15443
      name: tls
      protocol: TCP
    - name: tls-webhook
      port: 15017
      targetPort: 15017
    env:
      # A gateway with this mode ensures that pilot generates an additional
      # set of clusters for internal services but without Istio mTLS, to
      # enable cross cluster routing.
      ISTIO_META_ROUTER_MODE: "sni-dnat"
      ISTIO_META_REQUESTED_NETWORK_VIEW: "network1"
    serviceAnnotations:
      service.beta.kubernetes.io/aws-load-balancer-type: nlb

global:
  meshID: mesh1
  multiCluster:
    # Set to true to connect two kubernetes clusters via their respective
    # ingressgateway services when pods in each cluster cannot directly
    # talk to one another. All clusters should be using Istio mTLS and must
    # have a shared root CA for this model to work.
    enabled: true
    # Should be set to the name of the cluster this installation will run in. This is required for sidecar injection
    # to properly label proxies
    clusterName: "cluster1"
    globalDomainSuffix: "local"
    # Enable envoy filter to translate `globalDomainSuffix` to cluster local suffix for cross cluster communication
    includeEnvoyFilter: false
  network: network1

For Cluster 2

gateways:
  istio-ingressgateway:
    name: istio-eastwestgateway
    labels:
      app: istio-eastwestgateway
      istio: eastwestgateway
      topology.istio.io/network: network2
    ports:
    ## You can add custom gateway ports in user values overrides, but it must include those ports since helm replaces.
    # Note that AWS ELB will by default perform health checks on the first port
    # on this list. Setting this to the health check port will ensure that health
    # checks always work. https://github.com/istio/istio/issues/12503
    - port: 15021
      targetPort: 15021
      name: status-port
      protocol: TCP
    - port: 80
      targetPort: 8080
      name: http2
      protocol: TCP
    - port: 443
      targetPort: 8443
      name: https
      protocol: TCP
    - port: 15012
      targetPort: 15012
      name: tcp-istiod
      protocol: TCP
    # This is the port where sni routing happens
    - port: 15443
      targetPort: 15443
      name: tls
      protocol: TCP
    - name: tls-webhook
      port: 15017
      targetPort: 15017
    env:
      # A gateway with this mode ensures that pilot generates an additional
      # set of clusters for internal services but without Istio mTLS, to
      # enable cross cluster routing.
      ISTIO_META_ROUTER_MODE: "sni-dnat"
      ISTIO_META_REQUESTED_NETWORK_VIEW: "network2"
    serviceAnnotations:
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
global:
  meshID: mesh1
  multiCluster:
    # Set to true to connect two kubernetes clusters via their respective
    # ingressgateway services when pods in each cluster cannot directly
    # talk to one another. All clusters should be using Istio mTLS and must
    # have a shared root CA for this model to work.
    enabled: true
    # Should be set to the name of the cluster this installation will run in. This is required for sidecar injection
    # to properly label proxies
    clusterName: "cluster2"
    globalDomainSuffix: "local"
    # Enable envoy filter to translate `globalDomainSuffix` to cluster local suffix for cross cluster communication
    includeEnvoyFilter: false
  network: network2
helm install istio-eastwestgateway manifests/charts/gateways/istio-ingress --namespace istio-system --values cluster1_gateway_config.yaml

helm install istio-eastwestgateway manifests/charts/gateways/istio-ingress --namespace istio-system --values cluster2_gateway_config.yaml

After adding the new east-west gateway, you will get an east-west gateway pod deployed in the istio-system namespace and the service which creates a Network Load Balancer specified in the annotations. You will need to resolve the IP address of the NLBs for the east-west gateways and then patch them into the service as spec.externalIPs in both of your clusters, until Multi-Cluster/Multi-Network – Cannot use a hostname-based gateway for east-west traffic · Issue #29359 · istio/istio  is fixed. This is not an ideal situation because of the following reasons.

k get svc -n istio-system istio-eastwestgateway
NAME                    TYPE           CLUSTER-IP      EXTERNAL-IP                                                                                  PORT(S)                                                                                      AGE
istio-eastwestgateway   LoadBalancer   100.71.211.32   a927e6<TRUNCATED>.elb.us-west-2.amazonaws.com 15021:32138/TCP,80:30420/TCP,443:31450/TCP,15012:30150/TCP,15443:30476/TCP,15017:32335/TCP   8d

nslookup a927e6<TRUNCATED>.elb.us-west-2.amazonaws.com
Server:        172.23.241.180
Address:    172.23.241.180#53
Non-authoritative answer:
Name:    a927e6<TRUNCATED>.elb.us-west-2.amazonaws.com
Address: 35.X.X.X

kubectl patch svc -n istio-system istio-eastwestgateway -p '{"spec":{"externalIPs": ["35.X.X.X"]}}'

k get svc -n istio-system istio-eastwestgateway
NAME                    TYPE           CLUSTER-IP      EXTERNAL-IP                                                                                  PORT(S)                                                                                      AGE
istio-eastwestgateway   LoadBalancer   100.71.211.32   a927e6<TRUNCATED>.elb.us-west-2.amazonaws.com,35.X.X.X   15021:32138/TCP,80:30420/TCP,443:31450/TCP,15012:30150/TCP,15443:30476/TCP,15017:32335/TCP   8d

Now that the gateway is configured to communicate, you will have to make sure the API of each cluster is able to talk to the other cluster. You can do this in AWS by making sure API instances are accessible to each other by creating specific rules for their security group. We will then need to create a secret in cluster 1 that provides access to cluster 2’s API server and vice versa for endpoint discovery.

#Enable endpoint discovery Cluster 2
istioctl x create-remote-secret --context="${CTX_CLUSTER1}" --name=cluster1 |kubectl apply -f - --context="${CTX_CLUSTER2}"

#Enable endpoint discovery Cluster 1
istioctl x create-remote-secret --context="${CTX_CLUSTER2}" --name=cluster2 |kubectl apply -f - --context="${CTX_CLUSTER1}"

At this stage, the pilot (which is bundled in istiod binary) should have the new configuration, and when you tail the logs for the pod, you should be able see the log's message “Number of remote cluster: 1”. With this version, you also would need to edit the ingress east-west gateway in the istio-system namespace that we created above as the selector label and the annotation added via Helm chart is different than expected. It shows “istio: ingressgateway” but should be “istio: eastwestgateway”. You can now create pods in each cluster and verify it is working as expected. Here is how the east-west gateway should look:

apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  annotations:
    meta.helm.sh/release-name: istio-eastwestgateway
    meta.helm.sh/release-namespace: istio-system
  creationTimestamp: "2021-05-13T01:56:50Z"
  generation: 2
  labels:
    app: istio-eastwestgateway
    app.kubernetes.io/managed-by: Helm
    install.operator.istio.io/owning-resource: unknown
    istio: eastwestgateway
    istio.io/rev: default
    operator.istio.io/component: IngressGateways
    release: istio-eastwestgateway
    topology.istio.io/network: network2
  name: istio-multicluster-ingressgateway
  namespace: istio-system
  resourceVersion: "6777467"
  selfLink: /apis/networking.istio.io/v1beta1/namespaces/istio-system/gateways/istio-multicluster-ingressgateway
  uid: 618b2b5b-a2bb-4b37-a4a1-7f5ab7ef03d4
spec:
  selector:
    istio: eastwestgateway
  servers:
  - hosts:
    - '*.local'
    port:
      name: tls
      number: 15443
      protocol: TLS
    tls:
      mode: AUTO_PASSTHROUGH



Improve your application with service mesh

Improving Your Application with Service Mesh

Engineering + Technology = Uptime 

Have you come across the term “application value” lately? Software-first organizations are using it as a new form of currency. Businesses delivering a product or service to its customers through an application understand the growing importance of their application’s security, reliability and feature velocity. And, as applications that people use become increasingly important to enterprises, so do engineering teams and the right tools 

The Right People for the Job: Efficient Engineering Teams 

Access to engineering talent is now more important to some companies than access to capital. 61% of executives consider this a potential threat to their business. With the average developer spending more than 17 hours each week dealing with maintenance issues, such as debugging and refactoring, plus approximately four hours a week on “bad code” (representing nearly $85 billion worldwide in opportunity cost lost annually), the necessity of driving business value with applications increases. And who is it that can help to solve these puzzles? The right engineering team, in combination with the right technologies and tools. Regarding the piece of the puzzle that can solved by your engineering team, enterprises have two options as customer demands on applications increase:  

  1. Increase the size and cost of engineering teams, or  
  2. Increase your engineering efficiency.  

Couple the need to increase the efficiency of your engineering team with the challenges around growing revenue in increasingly competitive and low margin businessesand the importance of driving value through applications is top of mind for any business. One way to help make your team more efficient is by providing the right technologies and tools. 

The Right Technology for the Job: Microservices and Service Mesh 

Using microservices architectures allows enterprises to more quickly deliver new features to customers, keeping them happy and providing them with more value over timeIn addition, with microservices, businesses can more easily keep pace with the competition in their space through better application scalability, resiliency and agility. Of course, as with any shift in technology, there can be new challenges.  

One challenge our customers sometimes face is difficulty with debugging or resolving problems within these microservices environments. It can be challenging to fix issues fast, especially when there are cascading failures that can cause your users to have a bad experience on your applicationThat’s where a service mesh can help. 

Service mesh provides ways to see, identify, trace and log when errors occurred and pinpoint their sources. It brings all of your data together into a single source of truth, removing error-prone processes, and enabling you to get fast, reliable information around downtime, failures and outages. More uptime means happy users and more revenue, and the agility with stability that you need for a competitive edge. 

Increasing Your Application Value  

Service mesh allows engineering teams to address many issues, but especially these three critical areas: 

  • Proactive issue detection, quick incident response, and workflows that accelerate fixing issues 
  • A unified source of multi-dimensional insights into application and infrastructure health and performance that provides context about the entire software system 
  • Line of sight into weak points in environments, enabling engineering teams to build more resilient systems in the future  

If you or your team are running Kubernetes-based applications at scale and are seeing the advantages, but know you can get more value out of them by increasing your engineering efficiency and uptime for your application's’ users, it’s probably time to check out a service mesh. You can reach out to the Aspen Mesh team on how to easily get started or how to best integrate service mesh into your existing stack at hello@aspenmesh.io. Or you can get started yourself with a 30-day free trial of Aspen Mesh. 


istiocon 9 trends

Top 9 Takeaways from IstioCon 2021

At the beginning of last year, we predicted the top three developments around service mesh in 2020 would be:

  1. A quickly growing need for service mesh
  2. Istio will be hard to beat
  3. Core service mesh use cases will emerge that will be used as models for the next wave of adopters

And we were right about all three, as evidenced by what we learned at IstioCon.

As a new community-led event, IstioCon 2021 provided the first organized opportunity for Istio’s community members to gather together on a large, worldwide scale, to present, learn and discuss the many features and benefits of the Istio service mesh. And this event was a resounding success.

With over 4,000 attendees — in its first year, and as a virtual event — IstioCon attendance exceeded expectations by multiples. The event showcased the lessons learned from running Istio in production, first-hand experiences from the Istio community, and featured maintainers from across the Istio ecosystem including Lin Sun, John Howard, Christian Posta, Neeraj Poddar, and more. With sessions presented across five days in English, as well as keynotes and sessions in Chinese, this was indeed a worldwide effort. It is well-known that the Istio community reaches far and wide, but it was fantastic to see that so many people interested in, considering, and even using Istio in production at scale were ready to show up and share.

But apart from the outstanding response of the Istio community, we were particularly excited to dig into what people are really using this service mesh for and how they’re interacting with it. So, we’ve pulled together the below curated list of top Istio trends, hot topics, and our top three list of sessions you don’t want to miss.

Top 3 Istio Service Mesh Trends to Watch

After watching each session (so you don’t have to!), we’ve distilled the top three service mesh and Istio industry takeaways that came out of IstioCon that you should keep on your radar.

1. Istio is production-ready. No longer just a shiny new object, this nascent technology has transformed over the past few years from a new infrastructure technology into the microservices management technology that people are using, now, in production and at scale at real companies. We saw insightful user story presentations from T-Mobile, Airbnb, eBay, Salesforce, FICO, and more.

2. Istio is more versatile than you thought. Did you know that Istio is being used right now by users and companies to manage everything from user-facing applications like Airbnb to behind-the-scenes infrastructure like running 5G?

3. Istio and Kubernetes have a lot in common. There are lots of similarities between Istio and Kubernetes in terms of how these technologies have developed, and how they are being adopted. It’s well known that Kubernetes is “the defacto standard for cloud native applications.” Istio is being called ”the most popular service mesh” according to the CNCF annual user survey. But more than this, the two are growing closer together in terms of the technologies themselves. We look forward to the growth of both technologies.

Top 3 Hot Topics

In addition to higher level industry trends, there were many other hot topics that surfaced as part of this conference. From security to Wasm, multicluster, integrations, policies, ORAS, and more, there is a lot going on in the service mesh marketplace that many folks may not have realized. Here are the three hot topics we’d like you to know about:

1. Mulitcluster. You can configure a single mesh to include multiple clusters. Using a multicluster deployment within a single mesh affords capabilities beyond that of a single cluster deployment, including fault isolation and fail over, location-aware routing, various control plane models, and team or project isolation. It was indeed a hot topic at IstioCon, with an entire workshop devoted to Istio Multicluster, plus two additional individual sessions and a dedicated office-hours session about multicluster.

2. Wasm. WebAssembly (Wasm) is a sandboxing technology that can be used to extend the Istio proxy (Envoy). The Proxy-Wasm sandbox API replaces Mixer as the primary extension mechanism in Istio. Over the past year, Wasm has come further to the forefront in terms of interest, as seen here by garnering two sessions plus its own office-hours session.

3. Security. Let’s face it, we’re all concerned about security, and with good reason. Istio has decided to face security challenges head on, and while not exactly a new topic, it’s one worth reiterating. The Istio Product Security Working Group had a session, plus we saw two more sessions featuring security as a headliner, and a dedicated office-hours session. 

Side note: Aspen Mesh had a tie with one another hot topic; debugging Istio. If you get a chance, check out the three recorded sessions on debugging as well.

Top 3 Sessions You Will Want to Watch On-demand

Not everyone has time to watch a conference for five days in a row. And that’s ok. There are about 77 sessions we wish you could watch, but we’ve also identified the top three we think you’ll get the most out of. Check these out:

1. Using Istio to Build the Next Generation 5G Platform. As the most-watched session at this event, we have to start here. In this session, Aspen Mesh’s Co-founder and Chief Architect Neeraj Poddar and David Lenrow, Senior Principal Cloud Security Architect at Verizon, covered what 5G is and why it matters, architecture options with Istio, platform requirements, security, and more.

2. User story from Salesforce - The Salesforce Service Mesh: Our Istio Journey. In this session, Salesforce Software Architect Pratima Nambiar talked us through their background around why they needed a service mesh, their initial implementation, Istio’s value, progressive adoption of Istio, and features they are watching and expect to adopt. 

3. User story from eBay - Istio at Scale: How eBay is Building a Massive Multitenant Service Mesh Using Istio. In this session, Sudheendra Murthy covered eBay’s story, from their applications deployment to service mesh journey, scale testing, and future direction.

What’s Next for Istio?

We were excited to be part of this year’s IstioCon, and it was wonderful to see the Istio community come together for this new event. As our team members have been key contributors to the Istio project over the past few years, we’ve had a front row seat at the growth of the project itself along with the community.

To learn more about what the Istio project has coming up on the horizon, check out this project roadmap session. We’re looking forward to the continued growth of this open source technology, so that more companies — and people — can benefit from what it has to offer.


doubling down on istio

Doubling Down On Istio

Good startups believe deeply that something is true about the future, and organize around it.

When we founded Aspen Mesh as a startup inside of F5, my co-founders and I believed these things about the future:

  1. App developers would accelerate their pace of innovation by modularizing and building APIs between modules packaged in containers.
  2. Kubernetes APIs would become the lingua franca for describing app and infrastructure deployments and Kubernetes would be the best platform for those APIs.
  3. The most important requirement for accelerating is to preserve control without hindering modularity, and that’s best accomplished as close to the app as possible.

We built Aspen Mesh to address item 3. If you boil down reams of pitch decks, board-of-directors updates, marketing and design docs dating back to summer of 2017, that's it. That's what we believe, and I still think we're right.

Aspen Mesh is a service mesh company, and the lowest levels of our product are the open-source service mesh Istio. Istio has plenty of fans and detractors; there are plenty of legitimate gripes and more than a fair share of uncertainty and doubt (as is the case with most emerging technologies). With that in mind, I want to share why we selected Istio and Envoy for Aspen Mesh, and why we believe more strongly than ever that they're the best foundation to build on.

 

Why a service mesh at all?

A service mesh is about connecting microservices. The acceleration we're talking about relies on applications that are built out of small units (predominantly containers) that can be developed and owned by a single team. Stitching these units into an overall application requires APIs between them. APIs are the contract. Service Mesh measures and assists contract compliance. 

There's more to it than reading the 12-factor app. All these microservices have to effectively communicate to actually solve a user's problem. Communication over HTTP APIs is well supported in every language and environment so it has never been easier to get started.  However, don't let the simplicity delude: you are now building a distributed system. 

We don't believe the right approach is to demand deep networking and infrastructure expertise from everyone who wants to write a line of code.  You trade away the acceleration enabled by containers for an endless stream of low-level networking challenges (as much as we love that stuff, our users do not). Instead, you should preserve control by packaging all that expertise into a technology that lives as close to the application as possible. For Kubernetes-based applications, this is a common communication enhancement layer called a service mesh.

How close can you get? Today, we see users having the most success with Istio's sidecar container model. We forecasted that in 2017, but we believe the concept ("common enhancement near the app") will outlive the technical details.

This common layer should observe all the communication the app is making; it should secure that communication and it should handle the burdens of discovery, routing, version translation and general interoperability. The service mesh simplifies and creates uniformity: there's one metric for "HTTP 200 OK rate", and it's measured, normalized and stored the same way for every app. Your app teams don't have to write that code over and over again, and they don't have to become experts in retry storms or circuit breakers. Your app teams are unburdened of infrastructure concerns so they can focus on the business problem that needs solving.  This is true whether they write their apps in Ruby, Python, node.js, Go, Java or anything else.

That's what a service mesh is: a communication enhancement layer that lives as close to your microservice as possible, providing a common approach to controlling communication over APIs.

 

Why Istio?

Just because you need a service mesh to secure and connect your microservices doesn't mean Envoy and Istio are the only choice.  There are many options in the market when it comes to service mesh, and the market still seems to be expanding rather than contracting. Even with all the choices out there, we still think Istio and Envoy are the best choice.  Here's why.

We launched Aspen Mesh after learning some lessons with a precursor product. We took what we learned, re-evaluated some of our assumptions and reconsidered the biggest problems development teams using containers were facing. It was clear that users didn't have a handle on managing the traffic between microservices and saw there weren't many using microservices in earnest yet so we realized this problem would get more urgent as microservices adoption increased. 

So, in 2017 we asked what would characterize the technology that solved that problem?

We compared our own nascent work with other purpose-built meshes like Linkerd (in the 1.0 Scala-based implementation days) and Istio, and non-mesh proxies like NGINX and HAProxy. This was long before service mesh options like Consul, Maesh, Kuma and OSM existed. Here's what we thought was important:

  • Kubernetes First: Kubernetes is the best place to position a service mesh close to your microservice. The architecture should support VMs, but it should serve Kubernetes first.
  • Sidecar "bookend" Proxy First: To truly offload responsibility to the mesh, you need a datapath element as close as possible to the client and server.
  • Kubernetes-style APIs are Key: Configuration APIs are a key cost for users.  Human engineering time is expensive. Organizations are judicious about what APIs they ask their teams to learn. We believe Kubernetes API design and mechanics got it right. If your mesh is deployed in Kubernetes, your API needs to look and feel like Kubernetes.
  • Open Source Fundamentals: Customers will want to know that they are putting sustainable and durable technology at the core of their architecture. They don't want a technical dead-end. A vibrant open source community ensures this via public roadmaps, collaboration, public security audits and source code transparency.
  • Latency and Efficiency: These are performance keys that are more important than total throughput for modern applications.

As I look back at our documented thoughts, I see other concerns, too (p99 latency in languages with dynamic memory management, layer 7 programmability). But the above were the key items that we were willing to bet on. So it became clear that we had to palace our bet on Istio and Envoy. 

Today, most of that list seems obvious. But in 2017, Kubernetes hadn’t quite won. We were still supporting customers on Mesos and Docker Datacenter. The need for service mesh as a technology pattern was becoming more obvious, but back then Istio was novel - not mainstream. 

I'm feeling very good about our bets on Istio and Envoy. There have been growing pains to be sure. When I survey the state of these projects now, I see mature, but not stagnant, open source communities.  There's a plethora of service mesh choices, so the pattern is established.  Moreover the continued prevalence of Istio, even with so many other choices, convinces me that we got that part right.

 

But what about...?

While Istio and Envoy are a great fit for all those bullets, there are certainly additional considerations. As with most concerns in a nascent market, some are legitimate and some are merely noise. I'd like to address some of the most common that I hear from conversations with users.

"I hear the control plane is too complex" - We hear this one often. It’s largely a remnant of past versions of Istio that have been re-architected to provide something much simpler, but there's always more to do. We're always trying to simplify. The two major public steps that Istio has taken to remedy this include removing standalone Mixer, and co-locating several control plane functions into a single container named istiod.

However, there's some stuff going on behind the curtains that doesn't get enough attention. Kubernetes makes it easy to deploy multiple containers. Personally, I suspect the root of this complaint wasn't so much "there are four running containers when I install" but "Every time I upgrade or configure this thing, I have to know way too many details."  And that is fixed by attention to quality and user-focus. Istio has made enormous strides in this area. 

"Too many CRDs" - We've never had an actual user of ours take issue with a CRD count (the set of API objects it's possible to define). However, it's great to minimize the number of API objects you may have to touch to get your application running. Stealing a paraphrasing of Einstein, we want to make it as simple as possible, but no simpler. The reality: Istio drastically reduced the CRD count with new telemetry integration models (from "dozens" down to 23, with only a handful involved in routine app policies). And Aspen Mesh offers a take on making it even simpler with features like SecureIngress that map CRDs to personas - each persona only needs to touch 1 custom resource to expose an app via the service mesh.

"Envoy is a resource hog" - Performance measurement is a delicate art. The first thing to check is that wherever you're getting your info from has properly configured the system-under-measurement.  Istio provides careful advice and their own measurements here.  Expect latency additions in the single-digit-millisecond range, knowing that you can opt parts of your application out that can't tolerate even that. Also remember that Envoy is doing work, so some CPU and memory consumption should be considered a shift or offload rather than an addition. Most recent versions of Istio do not have significantly more overhead than other service meshes, but Istio does provide twice as many feature, while also being available in or integrating with many more tools and products in the market. 

"Istio is only for really complicated apps” - Sure. Don’t use Istio if you are only concerned with a single cluster and want to offload one thing to the service mesh. People move to Kubernetes specifically because they want to run several different things. If you've got a Money-Making-Monolith, it makes sense to leave it right where it is in a lot of cases. There are also situations where ingress or an API gateway is all you need. But if you've got multiple apps, multiple clusters or multiple app teams then Kubernetes is a great fit, and so is a service mesh, especially as you start to run things at greater scale.

In scenarios where you need a service mesh, it makes sense to use the service mesh that gives you a full suite of features. A nice thing about Istio is you can consume it piecemeal - it does not have to be implemented all at once. So you only need mTLS and tracing now? Perfect. You can add mTLS and tracing now and have the option to add metrics, canary, traffic shifting, ingress, RBAC, etc. when you need it.

We’re excited to be on the Istio journey and look forward to continuing to work with the open source community and project to continue advancing service mesh adoption and use cases. If you have any particular question I didn’t cover, feel free to reach out to me at @notthatjenkins. And I'm always happy to chat about the best way to get started on or continue with service mesh implementation. 


steering future of istio

Steering The Future Of Istio

I’m honored to have been chosen by the Istio community to serve on the Istio Steering Committee along with Christian Posta, Zack Butcher and Zhonghu Xu. I have been fortunate to contribute to the Istio project for nearly three years and am excited by the huge strides the project has made in solving key challenges that organizations face as they shift to cloud-native architecture. 

Maybe what’s most exciting is the future direction of the project. The core Istio community realizes and advocates that innovation in Open Source doesn't stop with technology - it’s just the starting point. New and innovative ways of growing the community include making contributions easier, Working Group meetings more accessible and community meetings an open platform for end users to give their feedback. As a member of the steering committee, one of my main goals will be to make it easier for a diverse group of people to more easily contribute to the project.

Sharing my personal journey with Istio, when I started contributing to Istio, I found it intimidating to present rough ideas or proposals in an open Networking WG meeting filled with experts and leaders from Google & IBM (even though they were very welcoming). I understand how difficult it can be to get started on contributing to a new community, so I want to ensure the Working Group and community meetings are a place for end users and new contributors to share ideas openly, and also to learn from industry experts. I will focus on increasing participation from diverse groups, through working to make Istio the most welcoming community possible. In this vein, it will be important for the Steering Committee to further define and enforce a code of conduct creating a safe place for all contributors.

The Istio community’s effort towards increasing open governance by ensuring no single organization has control over the future of the project has certainly been a step in the right direction with the new makeup of the steering committee. I look forward to continuing work in this area to make Istio the most open project it can be. 

Outside of code contributions, marketing and brand identity are critically important aspects of any open source project. It will be important to encourage contributions from marketing and business leaders to ensure we recognize non-technical contributions. Addressing this is less straightforward than encouraging and crediting code commits, but a diverse vendor neutral marketing team in Open Source can create powerful ways to reach users and drive adoption, which is critical to the success of any open source project. Recent user empathy sessions and user survey forms are a great starting point, but our ability to put these learning into actions and adapt as a community will be a key driver in growing project participation.

Last, but definitely not least, I’m keen to leverage my experience and feedback from years of work with Aspen Mesh customers and broad enterprise experience to make Istio a more robust and production-ready project. 

In this vein, my fellow Aspen Mesher Jacob Delgado has worked tirelessly for many months contributing to Istio. As a result of his contributions, he has been named a co-lead for the Istio Product Security Working Group. Jacob has been instrumental in championing security best practices for the project and has also helped responsibly remediate several CVEs this year. I’m excited to see more contributors like Jacob make significant improvements to the project.

I'm humbled by the support of the community members who voted in the steering elections and chose such a talented team to shepherd Istio forward. I look forward to working with all the existing, and hopefully many new, members of the Istio community! You can always reach out to me through email, Twitter or Istio Slack for any community, technical or governance matter, or if you just want to chat about a great idea you have.


What Are Companies Using Service Mesh For?

We recently worked with 451 Research to identify current trends in the service mesh space. Together, we identified some key service mesh trends and patterns around how companies are adopting service mesh, and emerging use cases that are driving that adoption. Factors driving adoption include how service mesh automates and bolsters security, and a recognition of service mesh observability capabilities to ease debugging and decrease Mean Time To Resolution (MTTR). Check out this video for more from 451 Research's Senior Analyst in Application and Infrastructure Performance, Nancy Gohring, on this topic:

Who’s Using Service Mesh 

According to data and insights gathered by 451 Research, service mesh already has significant momentum, even though it is a young technology. Results from the Voice of the Enterprise: DevOps, Workloads & Key Projects 2020 survey tell us that 16% of respondents had adopted service mesh across their entire IT organizations, and 20% had adopted service mesh at the team level. Outside of those numbers, 38% of respondents also reported that they are in trials or planning to use service mesh in the future. As Kubernetes dominates the microservices landscape, the need for a service mesh to manage layer 7 communication is becoming increasingly clear. 

451 Research Service Mesh Adoption

In tandem with this growing adoption trend, the technology itself is expanding quickly. While the top driver of service mesh adoption continues to be supporting traffic management, service mesh provides many additional capabilities beyond controlling traffic. 451 found that key new capabilities the technology provides includes greatly enhanced security as well as increased observability into microservices.

Service Mesh and Security

Many organizations—particularly those in highly regulated industries such as healthcare and financial services—need to comply with very demanding security and regulatory requirements. A service mesh can be used to enforce or enhance important security and compliance policies more consistently, and across teams, at an organization-wide level. A service mesh can be used to:

  • Apply security policies to all traffic at ingress, and encrypt traffic using mTLS traveling between services
  • Add Zero-Trust networking
  • Govern certificate management for authenticating identity
  • Enforce level of least privilege with role-based access control (RBAC)
  • Manage policies consistently, regardless of protocols and runtimes 

These capabilities are particularly important for complex microservices deployments, and allow DevOps teams to ensure a strong security posture while running in production at global scale. 

Observability and Turning Your Data into Intelligence

In addition to helping enterprises improve their security posture, a service mesh also greatly improves observability through traces and metrics that allow operators to quickly root cause any failures and ensure resilient applications. Enabling the rapid resolution of performance problems allows DevOps teams to reduce mean time to resolution (MTTR) and optimize engineering efficiency

The broader market trends around observability and advanced analytics with open source technologies are also key to the success of companies adopting service mesh. There are challenges around managing microservices environments, and teams need better ways of identifying the sources of performance issues in order to resolve problems faster and more efficiently. Complex microservices-based applications generate very large amounts of data. Many open source projects are addressing this by making it easier for users to collect data from these environments, and advancements in analytics tools are enabling users to extract the signal from the noise, quickly directing users to the source of performance problems. 

Overcoming this challenge is why we created Aspen Mesh Rapid Resolve. It allows users to see any configuration or policy changes made within Kubernetes clusters, which is almost always the cause of failures. The Rapid Resolve timeline view makes it simple for operators to look back in time to pinpoint any changes that resulted in performance degradation. 

Aspen Mesh Rapid Resolve

This enables Aspen Mesh users to identify root causes, report actions and apply fixing configurations all in one place. For example, the Rapid Resolve suite offers many new features including:

  • Restore: a smarter, machine-assisted way to effectively reduce the set of things an operator or developer has to look through to find the root cause of failure in their environment. Root causing in distributed architectures is hard. Aspen Mesh Restore immediately alerts engineers to any performance outside acceptable thresholds and makes it obvious where any configuration, application or infrastructure changes occurred that are likely to be breaking changes.
  • Replay: a one-stop shop for application troubleshooting and reducing time to recovery. Aspen Mesh Replay gives you the current and the past view of your cluster state, including microservices connectivity, traffic and service health, and relevant events like configuration changes and alerts along the way. This view is great for understanding and diagnosing cascading failures. You can easily roll back in time and detect where a failure started. It's also a good tool for sharing information in larger groups where you can track the health of your cluster visually over time.

The Future of Service Mesh

Companies strive for stability with agility, which allows them to meet the market and users where they are, and thrive even in an uncertain marketplace. According to 451 Research,

“Businesses are employing containers, Kubernetes and microservices as tools that allow them to more quickly respond to customer demands and competitive threats. However, these technologies introduce new and potentially significant management challenges. Advanced organizations have turned to service mesh to help solve some of these problems. Service mesh technology can remove infrastructure burdens from developers, enabling them to focus on creating valuable application features rather than managing the mechanics of microservices communications. But managing the communications layer isn’t the only benefit a service mesh brings to the table. Increasingly, users are recognizing the role service meshes can play in collecting and analyzing important observability data, as well as their ability to support security requirements.”

The adoption of containers, Kubernetes and service mesh is continuing to grow, and both security and observability will be key drivers that increase service mesh adoption in the coming years.