On Silly Animals and Gray Codes

I love Information Theory. This is a random rumination on surprise.  

Helm (v2) is a templating engine and release manager for Kubernetes.  Basically it lets you leverage the combined knowledge of experts on how you should configure container software, but still gives you nerd knobs you can tweak as needed. When Helm deploys software, it's called a release. You can name your releases, like ingress-controller-for-prod.  You'll use this name later: "Hey, Helm, how is ingress-controller-for-prod doing?" or "Hey, Helm, delete all the stuff you made for ingress-controller-for-prod."

If you don't name a release, Helm will make up a release name for you. It's a combination of an adjective and an animal:

"Monicker ships with a couple of word lists that were written and approved by a group of giggling school children (and their dad). We built a lighthearted list based on animals and descriptive words (mostly adjectives)."

So if you don't pick a name, Helm will pick one for you. You might get jaunty ferret or gauche octopus. Helm could have decided to pick unique identifiers, say UUIDs, so instead of jaunty ferret you get 9fa485b1-6e8b-47c4-baa1-3923394382a5 or e0c2def3-bc94-44ff-b702-985d4eb38ded. To Helm itself, the UUIDs would be fine. To the humans, though, I argue 9fa485b1-6e8b-47c4-baa1-3923394382a5 is a bad option because our brains aren't good handlers of long strings like 9fa485b1-6e8b-47c4-baa1-3932394382a5; it's hard to say 9fa485b1-6e8b-47e4-baa1-3923394382a5 and you're not even going to notice that I've actually subtly mixed up digits in 9fa485b1-6e8b-47c4-baa1-3923393482a5 through this entire paragraph.  But if I had mixed up jaunty ferret and jumpy ferret you at least stand a chance. This is true even though the bitwise difference between the inputs that generated jaunty ferret and jumpy ferret is actually smaller than my UUID tricks.

Humans are awful at handling arbitrarily long numbers. We can't fake them well. We get dazzled by them. We are miserable at comparing even short numbers, sometimes people die as a result.

So, if you're building identifiers into a system, you should consider if those are going to be seen by humans. And if so, I think you should make those identifiers suitable for humans: distinctive and pronounceable.

I've seen this used elsewhere; Docker does it for container names (but scientists and hackers instead of animals).  Netlify and Github will do it for project names.  LastPass has a "Pronounceable" option and pwgen walks a fine line; they explicitly trade a little entropy to avoid users "simply writ[ing] the password on a piece of paper taped to the monitor..." in the hell that is modern user/password management. I've also worked with a respected support organization that does this for customer issues (and all the humans seemed to be massively more effective IMing/emailing/Wiki-writing/Chatting in the hall about names instead of 10-digit numbers).

Aspen Mesh does this in a few places. The first benefit is some great GIFs. On our team, if Randy asks you to fix something in the "singing clams" object, he'll Slack you a GIF as well. The second benefit is distinctiveness - after you've seen a GIF of singing clams, the likelihood you accidentally delete the boasting aardvark object is basically nil. The likelihood that your dreams are haunted by singing clams is an entirely different concern.

via GIPHY

So I argue that replacing numbers with pronounceable and memorable human-language identifiers is great when we need things to be distinguishable and possible to remember. Humans are too easily tricked by subtle changes in long numbers.

An added bonus that we enjoy is that we bring some of our most meaningful cluster names to life at Aspen Mesh. Our first development cluster, our first production cluster and our first customer cluster all have a special place in our hearts. Naturally, we took those cluster names and made them into Aspen Mesh mascots:

  • jaunty-ferret
  • gauche-octopus
  • jolly-bat

Our cluster names make it easier for us to get development work done, and come with the added bonus of making the office more fun. If you want a set of these awesome cluster animals, leave a comment or tweet us @AspenMesh and we’ll send you a sticker pack. 


Why You Want Idempotency Anyway

We've been talking about how you can use a service mesh to do progressive delivery lately.  Progressive delivery fundamentally is about decoupling software delivery from user activation of said software.  Once decoupled, the user activation portion is under business control. It's early days, but the promise here is that software engineering can build new stuff as fast as they can, and your customer success team (already keeping a finger on the pulse of the users and market) can independently choose when to introduce what new features (and associated risk) to whom.

We've demonstrated how you can use Flagger (a progressive delivery Kubernetes operator) to command a service mesh to do canary deploys.  These help you activate functionality for only a subset of traffic and then progressively increase that subset as long as the new functionality is healthy.  We're also working on an enhancement to Flagger to do traffic mirroring. I think this is really cool because it lets you try out a feature activation without actually exposing any users to the impact.  It's a "pre-stage" to a canary deployment: Send one copy to the original service, one copy to the canary, and check if the canary responds as well as the original.

There's a caveat we bring up when we talk about this, however: Idempotency.  You can only do traffic mirroring if you're OK with duplicating requests, and sending one to the primary and one to the canary.  If your app and infrastructure is OK with duplicating these requests, they are said to be idempotent.

Idempotency

Idempotency is the ability to apply an operation more than once and not change the result.  In math, we'd say that:

f(f(a)) = f(a)

An example of a mathematical function that's idempotent is ABS(), the absolute value.

ABS(-3.7) = 3.7
ABS(ABS(3.7)) = ABS(3.7) = 3.7

We can repeat taking the absolute value of something as many times as we want, it won't change any more after the first time.  Similar things in your math toolbox: CEIL(), FLOOR(), ROUND(). But SQRT() is not in general idempotent. SQRT(16) = 4, SQRT(SQRT(16)) = 2, and so on.

For web apps, idempotent requests are those that can be processed twice without causing something invalid to happen.  So read-only operations like HTTP GETs are always idempotent (as long as they're actually only reads). Some kinds of writes are also idempotent: suppose I have a request that says "Set andrews_timezone to MDT".  If that request gets processed twice, my timezone gets set to MDT twice. That's OK. The first one might have changed it from PDT to MDT, and the second one "changes" it from MDT to MDT, so no change. But in the end, my timezone is MDT and so I'm good.

An example of a not-idempotent request is one that says "Deduct $100 from andrews_account".  If we apply that request twice, then my account will actually have $200 deducted from it and I don't want that.  You remember those e-commerce order pages that say "Don't click reload or you may be billed twice"? They need some idempotency!

This is important for traffic mirroring because we're going to duplicate the request and send one copy to the primary and one to the canary.  While idempotency is great for enabling this traffic mirroring case, I'm here to tell you why it's a great thing to have anyway, even if you're never going to do progressive delivery.

Exactly-Once Delivery and Invading Generals

There's a fundamental tension that emerges if you have distributed systems that communicate over an unreliable channel.  You can never be sure that the other side received your message exactly once.  I'll retell the parable of the Invading Generals as it was told to me the first time I was sure I had designed a system that solved this paradox.

There is a very large invading army that has camped in a valley for the night.  The defenders are split into two smaller armies that surround the invading army; one set of defenders on the eastern edge of the valley and one on the west.  If both defending armies attack simultaneously, they will defeat the invaders. However, if only one attacks, it will be too small; the invaders will defeat it, and then turn to the remaining defenders at their leisure and defeat them as well.

The general for the eastern defenders needs to coordinate with the general for the western defenders for simultaneous attack.  Both defenders can send spies with messages through the valley, as many as they want. Each spy has a 10% chance of being caught by the invaders and killed before his message is delivered, and a 90% chance of successfully delivering the message.  You are the general for the eastern defenders: what message will you send to the western defenders to guarantee you both attack simultaneously and defeat the invaders?

Turns out, there is no guaranteed safe approach.  Let's go through it. First, let's send this message:

"Western General, I will attack at dawn and you must do the same."

- Eastern General

There's a 90% chance that your spy will get through, but there's a 10% chance that he won't, only you will attack and you will lose to the invaders.  The problem statement said you have infinite spies, we must be able to do better!

OK, let's send lots of spies with the same message.  Then our probability of success is 1-0.1^n, where n is the number of spies.  So we can asymptotically approach 100% probability that the other side agrees, but we can never be sure.

How about this message:

"Western General, I am prepared to attack at dawn.  Send me a spy confirming that you have received this message so I know you will also attack.  If I don't receive confirmation, I won't attack because my army will be defeated if I attack alone."

- Eastern General

Now, if you don't receive a spy back from the western general you'll send another, and another, until you get a response.  But.... put yourself in the shoes of the western general. How does the western general know that you'll receive the confirmation spy?  Should the western army attack at dawn? What if the confirmation spy was caught and now only the western army attacks, ensuring defeat?

The western general could send lots of confirmation spies, so there is a high probability that at least one gets through.  But they can't guarantee with 100% probability that one gets through.

The western general could also send this response:

"Eastern General, we have received your spy.  We are also prepared to attack at dawn. We will be defeated if you do not also attack, and I know you won't attack if you don't know that we have received your message.  Please send back a spy confirming that you have received my confirmation or else we will not attack because we will be destroyed."

 

- Western General

A confirmation of a confirmation! (In networking ARQ terms, an ACK-of-an-ACK).  Again, this can reduce probability but cannot provide guarantees: we can keep shifting uncertainty between the Eastern and Western generals but never eliminate it.

Engineering Approaches

Okay, we can't know for sure that our message is delivered exactly once (regardless of service mesh or progressive delivery or any of that), so what are we going to do?  There are a few approaches:

• Retry naturally-idempotent requests

• Uniquefy requests

• Conditional updates

• Others

Retry Naturally-Idempotent Requests

If you have a request that is naturally idempotent, like getting the temperature on a thermostat, the end user can just repeat it if they didn't get the response they want.

Uniqueify Requests

Another approach is to make requests unique at the client, and then have all the other services avoid processing the same unique request twice.  One way to do this is to invent a UUID at the client and then have servers remember all the UUIDs they've already seen. My deduction request would then look like:

This is unique request f41182d1-f4b2-49ec-83cc-f5a8a06882aa.
If you haven't seen this request before, deduct $100 from andrews_account.

Then you can submit this request as many times as you want to the processor, and the processor can check if it's handled "f41182d1-f4b2-49ec-83cc-f5a8a06882aa" before.  There are a few caveats here.

First you have to have a way to generate unique identifiers.  UUIDs are pretty good but theoretically there's an extremely small possibility of UUID collision; practically there's a couple of minor foot-guns to watch out for like generating UUIDs on two VMs or containers that both have fake virtual MAC addresses that match.  You can also have the server make the unique identifier for you (it could be an auto-generated primary key in a database that is guaranteed to be unique).

Second your server has to remember all the UUIDs that you have processed.  Typically you put these in a database (maybe using UUID as a primary key anyway).  If the record of processed UUIDs is different than the action you take when processing, there's still a "risk window": you might commit a UUID and then fail to process it, or you might process it and fail to commit the UUID.  Algorithms like two-phase commit and paxos can help close the risk window.

Conditional Updates

Another approach is to include information in the request about what things looked like when the client sent the request, so that the server can abort the request if something has changed.  This includes the case that the "change" is a duplicate request and we've already processed it.

For instance, maybe my bank ledger looks like this:

Then I would make my request look like:

As long as the last transaction in andrews_account is number 563,
Create entry 564: Deduct $100 from andrews_account

If this request gets duplicated, the first will succeed and the second will fail.  After the first:

The duplicated request will fail:

As long as the last transaction in andrews_account is number 563,
Create entry 564: Deduct $100 from andrews_account

In this case the server could respond to the first copy with "Success" and the second copy with a soft failure like "Already committed" or just tell the client to read and notice that its update already happened.  MongoDB, AWS Dynamo and others support these kinds of conditional updates.

Others

There are many practical approaches to this problem.  I recommend doing some initial reasoning about idempotency, and then try to shift as much functionality as you can to the database or persistent state layer you're using.  While I gave a quick tour of some of the things involved in idempotency, there are a lot of other tricks like write-ahead journalling, conflict-free replicated data types and others that can enhance reliability.

Conclusion

Traffic mirroring is a great way to exercise canaries in a production environment before exposing them to your users.  Mirroring makes a duplicate of each request and sends one copy to the primary, one copy to the new canary version of your microservice.  This means that you must use mirroring only for idempotent requests: requests that can be applied twice without causing something erroneous to happen.

This caveat probably exists even if you aren't doing traffic mirroring, because networks fail.  The Eastern General and Western General can never really be sure their messages are delivered exactly once, there will always be a case where they may have to retry.  I think you want to build idempotency wherever possible, and then you should use traffic mirroring to test your canary deployments.


Advancing the promise of service mesh: Why I work at Aspen Mesh

The themes and content expressed are mine alone, with helpful insight and thoughts from my colleagues, and are about software development in a business setting.

I’ve been working at Aspen Mesh for a little over a month and during that time numerous people have asked me why I chose to work here, given the opportunities in Boulder and the Front Range.

To answer that question, I need to talk a bit about my background. I’ve been a professional software developer for about 13 years now. During that time I’ve primarily worked on the back-end for distributed systems and have seen numerous approaches to the same problems with various pros and cons. When I take a step back, though, a lot of the major issues that I’ve seen are related to deployment and configuration around service communication:
How do I add a new service to an already existing system? How big, in scope, should these services be? Do I use a message broker? How do I handle discoverability, high availability and fault tolerance? How, and in what format, should data be exchanged between services? How do I audit the system when the system inevitably comes under scrutiny?

I’ve seen many different approaches to these problems. In fact, there are so many approaches, some orthogonal and some similar, that software developers can easily get lost. While the themes are constant, it is time consuming for developers to get up to speed with all of these technologies. There isn’t a single solution that solves every common problem seen in the backend; I’m sure the same applies to the front-end as well. It’s hard to truly understand the pros and cons of an approach until you have a working system; and when that happens and if you then realize that the cons outweigh the pros, it may be difficult and costly to get back to where you started (see sunk cost fallacy and opportunity cost). Conversely, analysis paralysis is also costly to an organization, both in terms of capital—software developers are not cheap—and an inability to quickly adapt to market pressures, be it customer needs and requirements or a competitor that is disrupting the market.

Yet the hype cycle continues. There is always a new shiny thing taking the software world by storm. You see it in discussions on languages, frameworks, databases, messaging protocols, architectures ad infinitum. Separating the wheat from the chaff is something developers must do to ensure they are able to meet their obligations. But with the signal to noise ratio being high at times and with looming deadlines not all possibilities can be explored.  

So as software developers, we have an obligation of due diligence and to be able to deliver software that provides customer value; that helps customers get their work done and doesn’t impede them, but enables them. Most customers don’t care about which languages you use or which databases you use or how you build your software or what software process methodology you adhere to, if any. They just want the software you provide to enable them to do their work. In fact, that sentiment is so strong that slogans have been made around it.

So what do customers care about, generally speaking? They care about access to their data, how they can view it and modify it and draw value from it. It should look and feel modern, but even that isn’t a strict requirement. It should be simple to use for a novice, but yet provide enough advanced capability to help your most advanced users make you learn something new about the tool you’ve created. This is information technology after all. Technology for technology’s sake is not a useful outcome.

Any work that detracts from adding customer value needs to be deprioritized, as there is always more work to do than hours in the day. As developers, it’s our job to be knee deep in the weeds so it’s easy to lose sight of that; unit testing, automation, language choice, cloud provider, software process methodology, etc… absolutely matter, but that they are a means to an end.

With that in mind, let’s create a goal: application developers should be application developers.

Not DevOps engineers, or SREs or CSRs, or any other myriad of roles they are often asked to take on. I’ve seen my peers happiest when they are solving difficult problems and challenging themselves. Not when they are figuring out what magic configuration setting is breaking the platform. Command over their domain and the ability and permission to “fix it” is important to almost every appdev.

If developers are expensive to hire, train, replace and keep then they need to be enabled to do their job to the best of their ability. If a distributed, microservices platform has led your team to solving issues in the fashion of Sherlock Holmes solving his latest mystery, then perhaps you need a different approach.

Enter Istio and Aspen Mesh

It’s hard to know where the industry is with respect to the Hype Cycle for technologies like microservices, container orchestration, service mesh and a myriad of other choices; this isn’t an exact science where we can empirically take measurements. Most companies have older, but proven, systems built on LAMP or Java application servers or monoliths or applications that run on a big iron system. Those aren’t going away anytime soon, and developers will need to continue to support and add new features and capabilities to these applications.

Any new technology must provide a path for people to migrate their existing systems to something new.

If you have decided to or are moving towards a microservice architecture, even if you have a monolith, implementing a service mesh should be among the possibilities explored. If you already have a microservice architecture that leverages gRPC or HTTP, and you're using Kubernetes then the benefits of a service mesh can be quickly realized. It's easy to sign up for our beta and install Aspen Mesh and the sample bookinfo application to see things in action. Once I did is when I became a true believer. Not being coupled with a particular cloud provider, but being flexible and able to choose where and how things are deployed empowers developers and companies to make their own choices.

Over the past month I’ve been able to quickly write application code and get it delivered faster than ever before; that is in large part due to the platform my colleagues have built on top of Kubernetes and Istio. I’ve been impressed by how easy a well built cloud-native architecture can make things, and learning more about where Aspen Mesh, Istio and Kubernetes are heading gives me confidence that community and adoption will continue to grow.

As someone that has dealt with distributed systems issues continuously throughout his career, I know managing and troubleshooting a distributed system can be exhausting. I just want to enable others, even Aspen Mesh as we dogfood our own software, to do their jobs. To enable developers to add value and solve difficult problems. To enable a company to monitor their systems, whether it be mission critical or a simple CRUD application, to help ensure high uptime and responsiveness. To enable systems to be easily auditable when the compliance personnel has GRDP, PCI DSS or HIPAA concerns. To enable developers to quickly diagnose issues within their own system, fix them and monitor the change. To enable developers to understand how their services are communicating with each other--if it’s an n-tier system or a spider’s web--and how requests propagate through their system.

The value of Istio and the benefits of Aspen Mesh in solving these challenges is what drew me here. The opportunities are abundant and fruitful. I get to program in go, in a SaaS environment and on a small team with a solid architecture. I am looking forward to becoming a part of the larger CNCF community. With microservices and cloud computing no longer being niche--which I’d argue hasn’t been the case for years--and with businesses adopting these new technology patterns quickly, I feel as if I made the right long-term career choice.


Top 3 Service Mesh Developments in 2019

Last year was about service mesh evaluation, trialing — and even hype.

While the interest in service mesh as a technology pattern was very high, it was mostly about evaluation and did not see widespread adoption. The capabilities service mesh can add to ease managing microservice-based applications at runtime are obvious, but the technology still needs to reach maturity before gaining widespread production adoption.

What we can say is service mesh adoption should evolve from the hype stage in a very real way this year.

What can we expect to see in 2019?

  1. The evolution and coalescing of service mesh as a technology pattern;
  2. The evolution of Istio as the way enterprises choose to implement service mesh;
  3. Clear uses cases that lead to wider adoption.

The Evolution of Service Mesh

There are several service mesh architectural options when it comes to service mesh, but undoubtedly, the sidecar architecture will see the most widespread usage in 2019. Sidecar proxy as the architectural pattern, and more specifically, Envoy as the technology, have emerged as clear winners for how the majority will implement service mesh.

Considering control plane service meshes, we have seen the space coalesce around leveraging sidecar proxies. Linkerd, with its merging of Conduit and release of Linkerd 2, got on the sidecar train. And the original sidecar control plane mesh, Istio, certainly has the most momentum in the cloud native space. A look at the Istio Github repo shows:

  • 14,500 stars;
  • 6,400 commits;
  • 300 contributors.

And if these numbers don’t clearly demonstrate the momentum of the project, just consider the number of companies building around Istio:

  • Aspen Mesh;
  • Avi Networks;
  • Cisco;
  • OpenShift;
  • NGINX;
  • Rancher;
  • Tufin Orca;
  • Tigera;
  • Twistlock;
  • VMware.

The Evolution of Istio

So the big question is where is the Istio project headed in 2019? I should start with the disclaimer that the following are all guesses. — they are well-informed guesses, but guesses nonetheless.

Community Growth

Now that Istio has hit 1.0, the number of contributors outside the core Google and IBM team are starting to grow. I’d hazard the guess that Istio will be truly stable around 1.3 sometime in June or July. Once the project gets to the point it is usable at scale in production, I think you’ll really see it take off.

Emerging Vendor Landscape

At Aspen Mesh, we hedged our bets on Istio 18 months ago. It seems to be becoming clear that Istio will win service mesh in much the same way Kubernetes has won container orchestration.

Istio is a powerful toolbox that directly addresses many microservices challenges that are being solved with multiple manual processes, or are not being solved at all. The power of the open source community surrounding it also seems to be a factor that will lead to widespread adoption. As this becomes clearer, the number of companies building on Istio and building Istio integrations will increase.

Istio Will Join the Cloud Native Computing Foundation

Total guess here, but I’d bet on this happening in 2019. CNCF has proven to be an effective steward of cloud-native open source projects. I think this will also be a key to widespread adoption which will be key to the long-term success of Istio. We shall see what the project founders decide, but this move will benefit everyone once the Istio project is at the point it makes sense for it to become a CNCF project.

Real-World Use Cases Are Key To Spreading Adoption

Service mesh is still a nascent market and in the next 12-24 months, we should see the market expand past just the early adopters. But for those who have been paying attention, the why of a service mesh has largely been answered. The whyis also certain to evolve, but for now, the reasons to implement a service mesh are clear. I think that large parts of the how are falling into place, but more will emerge as service mesh encounters real-world use cases in 2019.

I think what remains unanswered is “what are the real world benefits I am going to see when I put this into practice”? This is not a new question around an emerging technology. Neither will the way this question gets answered be anything new: and that will be through uses cases. I can’t emphasize enough how use cases based on actual users will be key.

Service mesh is a powerful toolbox, but only a small swath of users will care about how cool the tech is. The rest will want to know what problems it solves.

I predict 2019 will be the year of service mesh use cases that will naturally emerge as the number of adopters increases and begins to talk about the value they are getting with a service mesh.

Some Final Thoughts

If you are already using a service mesh, you understand the value it brings. If you’re considering a service mesh, pay close attention to this space and the number of uses cases will make the real world value proposition more clear. And if you’re not yet decided on whether or not you need a service mesh, check out the recent Gartner451 and IDC reports on microservices — all of which say a service mesh will be mandatory by 2020 for any organization running microservices in production.


service mesh

How The Service Mesh Space Is Like Preschool

I have a four year old son who recently started attending full day preschool. It has been fascinating to watch his interests shift from playing with stuffed animals and pushing a corn popper to playing with his science set (w00t for the STEM lab!) and riding his bike. The other kids in school are definitely informing his view of what cool new toys he needs. Undoubtedly, he could still make due with the popper and stuffed animals (he may sleep with Lambie until he's ten), but as he progresses his desire to explore new things increases.

Watching the community around service mesh develop is similar to watching my son's experience in preschool (if you're willing to make the stretch with me). People have come together in a new space to learn about cool new things, and as excited as they are, they don't completely understand the cool new things. Just as in preschool, there are a ton of bright minds that are eager to soak up new knowledge and figure out how to put it to good use.

Another parallel between my son and many of the people we talk to in the service mesh space is that they both have a long and broad list of questions. In the case of my son, it's awesome because they're questions like: "Is there a G in my name?" "What comes after Sunday?" "Does God live in the sky with the unicorns?" The questions we get from prospects and clients on service mesh are a bit different but equally interesting. It would take more time than anybody wants to spend to cover all these questions, but I thought it might be interesting to cover the top 3 questions we get from users evaluating service mesh.

What do I get with a service mesh?

We like getting this question because the answer to it is a good one. You get a toolbox that gives you a myriad of different capabilities. At a high level, what you get is observability, control and security of your microservice architecture. The features that a service mesh provide include:

  • Load balancing
  • Service discovery
  • Ingress and egress control
  • Distributed tracing
  • Metrics collection and visualization
  • Policy and configuration enforcement
  • Traffic routing
  • Security through mTLS

When do I need a service mesh?

You don't need 1,000 microservices for a service mesh to make sense. If you have nicknames for your monoliths, you're probably a ways away from needing a service mesh. And you probably don't need one if you only have 2 services, but if you have a few services and plan to continue down the microservices path it is easier to get started sooner. We are believers that containers and Kubernetes will be the way companies build infrastructure in the future, and waiting to hop on that train will only be a competitive disadvantage. Generally, we find that the answer to this question usually hinges on whether or not you are committed to cloud native. Service meshes like Aspen mesh work seamlessly with cloud native tools so the barrier to entry is low, and running cloud native applications will be much easier with the help of a service mesh.

What existing tools does service mesh allow me to replace?

This answer all depends on what functionality you want. Here's a look at tools that service mesh overlaps, what it provides and what you'll need to keep old tools for.

API gateway
Not yet. It replaces some of the functionality of a API gateway but does not yet cover all of the ingress and payment features an API gateway provides. Chances are API gateways and service meshes will converge in the future.

Tracing Tools
You get tracing capabilities as part of Istio. If you are using distributed tracing tools such as Jaeger or Zipkin, you no longer need to continue managing them separately as they are part of the Istio toolbox. With Aspen Mesh's hosted SaaS platform, we offer managed Jaeger so you don't even need to deploy or manage them.

Metrics Tools
Just like tracing, a metrics monitoring tool is included as part of Istio.With Aspen Mesh's hosted SaaS platform, we offer managed Prometheus and Grafana so you don't even need to deploy or manage them. Istio leverages Prometheus to query metrics. You have the option of visualizing them through the Prometheus UI, or using Grafana dashboards.

Load Balancing
Yep. Envoy is the sidecar proxy used by Istio and provides load balancing functionality such as automatic retries, circuit breaking, global rate limiting, request shadowing and zone local load balancing. You can use a service mesh in place of tools like HAProxy NGINX for ingress load balancing.

Security tools
Istio provides mTLS capabilities that address some important microservices security concerns. If you’re using SPIRE, you can definitely replace it with Istio which provides a more comprehensive utilisation of the SPIFFE framework. An important thing to note is that while a service mesh adds several important security features, it is not the end-all-be-all for microservices security. It’s important to also consider a strategy around network security.

If you have little ones and would be interested in comparing notes on the fantastic questions they ask, let’s chat. I'd also love to talk anything service mesh. We have been helping a broad range of customers get started with Aspen Mesh and make the most out of it for their use case. We’d be happy to talk about any of those experiences and best practices to help you get started on your service mesh journey. Leave a comment here or hit me up @zjory.


Enterprise Service Mesh

From Middleware to Containers: Infrastructure is Finally Cool

As someone fresh out of school just starting my software engineering career, I want to solve interesting problems. Who doesn’t? A computer science degree gave me the opportunity see a spectrum of different engineering opportunities, which led me to decide that working on infrastructure would be the most impactful area, and with the rise of cloud native technologies, actually a compelling space to work in. There is a difference between developing new functionality and developing to solve existing problems. More often than not, the solutions that address existing challenges in an industry are the ones the are used the most and last the longest. This is what excites me about working on infrastructure, the ability to build something that millions of people will rely on to run their applications. On the surface it doesn’t appear to be the most exciting work, but you can be sure that your time and effort is being put to good use.

You want to see your contributions make an impact somehow, whether that’s writing webapps, iPhone applications, business tools, etc. - the things that people actually use day-to-day. Infrastructure may not be as visible or as tangible as these kinds of technologies, but it’s gratifying to know that it’s the underlying piece that makes it all work. As much as I want to be able to say that I contribute to something that all of my non-tech friends can easily understand (like the front-end of Netflix), I think it’s even more interesting to make them think about the things that happen behind the scenes. We all expect our favorite apps, websites, etc. to be able to respond quickly to our requests no matter how many people are using them at the same time, but on the backend this is not something that is easy to handle and properly test for. What about security? We also expect that when we are trusting software with our information that it isn’t being easily intercepted or leaked along the way. Scalability and security are just two of many kinds of problems that software infrastructure incorporates, and in the end we are relying on them to actually make the front-end software usable. The advantage these days is that infrastructure software has become an incredibly interesting space to be in. Tools like Docker, Kubernetes and Istio are fascinating technologies with vibrant communities around them.

One of the cool, heavily used Kubernetes-related projects that I’m a fan of is Envoy. I can’t help but think about how some version of Envoy is being used every time I order a Lyft to make sure I actually get a ride. Infrastructure doesn’t seem as intriguing at first because as important it is, it’s running in the background and easily forgotten. Everyone needs it, but in the end, who wants to build it? The answer to that question is definitely changing as the infrastructure landscape evolves. Kubernetes, the OS of the cloud, has become a project that everyone wants a hand in. You don’t hear about people itching to make contributions to the Linux kernel, but you hear about Kubernetes and containers everywhere.

Coming up with solutions to solve the problems that we’re running into today has become more attractive to junior developers especially. We’re watching as more and more people are using technology every day, and like I mentioned before, we want our contributions to be impactful. How are we going to handle all of this traffic in a smooth and scalable way? Enter: distributed systems. Microservices are critical to constructing applications that can handle huge transaction volumes at scale. Enterprise applications run by companies like Lyft, Twitter and Google would fall apart with even normal rates of traffic without their distributed architectures. Working on these infrastructural pieces is challenging, and provides the impact that we, junior developers, are looking for.

Another thing that makes this work enticing to junior developers is that it involves an open source community. The way that the tech community has decided to solve some of these bigger, infrastructure-related problems has largely been through open source, which is both intimidating and inviting to those who are new to the tech industry. There is an open group of people talking about the technology and a community willing to help, but at the same time it’s daunting to contribute to these bigger open source projects when you’re just starting out. I will say, however, that the benefits of being able to leverage so many technologies and the community support make it a lot of fun to be a part of.

To recap, here are some of my favorite things about working on infrastructure:

  • We can solve some really hard problems with good infrastructure!
  • If it’s done right, you can build something that can be easily customized to solve problems of various sizes and for all kinds of use cases.
  • All of the cool things and services we consume daily rely on it. Talk about actually seeing your hard work being put to good use!
  • Whether you’re doing proprietary work or not, you are being introduced to open source and the community that comes with it.

I’ll admit, developing infrastructure, despite all of the interesting bits, is still not the most glamorous work. It’s the underlying technology that most people take for granted in their everyday use of technology, and is often less shiny than a beautifully designed UI and other components that sit on top of it. But once you dig in, it’s exciting to see what an impact you can make with it and cloud-native technologies and communities make it a fun space to work in. What I will say though is that it’s a great way to start out your career in tech, and it’s a fun, challenging, and very rewarding place to be.


Distributed Tracing, Istio and Your Applications

In the microservices world, distributed tracing is slowly becoming the most important tool for debugging and understanding your application dependencies. During my recent conversations in MeetUps and conferences, I found there was a lot of interest in how distributed tracing works but at the same time there was a fair amount of confusion on how tracing interacts with service meshes like Istio and Aspen Mesh. In particular, I had these following questions asked frequently:

  • How does tracing work with Istio? What information is collected and reported in the spans?
  • Do I have to change my applications to benefit from distributed tracing in Istio?
  • If I am currently reporting spans in my application how will it interact with spans reported from Istio?

In this blog I am going to try and answer these questions. Before we get deeper into these questions, a quick background on why or how I ended up writing tracing related blogs. If you follow the Aspen Mesh blog you would have noticed I wrote two blogs related to tracing, one on tracing requests to AWS services when using Istio, and the second on tracing gRPC applications with Istio.

We have a pretty small engineering team at Aspen Mesh and as it goes in most startups if you work frequently on a sub-system or component you quickly become (or labeled or assigned) a resident expert. I added tracing in our microservices and integrated it with Istio in the AWS environment and in that process uncovered various interesting interactions which I thought might be worth sharing. Over the last few months we have been using tracing very heavily to gain understanding of our microservices and it has now become the first place we look when things break. With that let's move on to answering the questions I mentioned above.

How does tracing work with Istio?

Istio injects a sidecar proxy (Envoy) in the Pod in which your application container is running. This sidecar proxy transparently intercepts (iptables magic) all network traffic going in and out of your application. Because of this interception the sidecar proxy is in a unique position to automatically trace all network requests (HTTP/1.1, HTTP/2.0 & gRPC).

Let's see what changes sidecar proxy makes to an incoming request to a Pod from a client (external or other microservices). From this point on I'm going to assume tracing headers are in Zipkin format for simplicity.

  • If the incoming request doesn't have any tracing headers, the sidecar proxy will create a root span (span where trace, parent and span IDs are all the same) before passing the request to the application container in the same Pod.
  • If the incoming request has tracing information (which should be the case if you're using Istio ingress or your microservice is being called from another microservice with sidecar proxy injected), the sidecar proxy will extract the span context from these headers, create a new sibling span (same trace, span and parent ID as incoming headers) before passing the request to the application container in the same Pod.

In the reverse directon when the application container is making outbound requests (external services or services in the cluster), the sidecar proxy in the Pod performs the following actions before making the request to the upstream service:

  • If no tracing headers are present, the sidecar proxy creates root span and injects the span context as tracing headers into the new request.
  • If tracing headers are present, the sidecar proxy extracts the span context from the headers, creates child span from this context. The new context is propagated as tracing headers in the request to the upstream service.

Based on the above explanation you should note that for every hop in your microservice chain you will get two spans reported from Istio, one from the client sidecar (span.kind set to client) and one from the server sidecar (span.kind set to server). All the spans created by the sidecars are automatically reported by the sidecars to the configured tracing backend systems like Jaeger or Zipkin.

Next let's look at the information reported in the spans. The spans contain the following information:

  • x-request-id: Reported as guid:x-request-id which is very useful in correlating access logs with spans.
  • upstream cluster: The upstream service to which the request is being made. If the span is tracking an incoming request to a Pod this is typically set to in.<name>. If the span is tracking an outbound request this is set to out.<name>.
  • HTTP headers: Following HTTP headers are reported when available:
    • URL
    • Method
    • User agent
    • Protocol
    • Request size
    • Response size
    • Response Flags
  • Start and end times for each span.
  • Tracing metadata: This includes the trace ID, span ID and the span kind (client or server). Apart from these the operation name is also reported for every span. The operation name is set to the configured virtual service (or route rule in v1alpha1) which affected the route or "default-route" if the default route was chosen. This is very useful in understanding which Istio route configuration is in effect for a span.

With that let's move on to the second question.

Do I have to change my application to gain benefit from tracing in Istio?

Yes, you will need to add logic in your application to propagate tracing headers from incoming to outgoing requests to gain full benefit from Istio's distributed tracing.

If the application container makes a new outbound request in the context of an incoming request and doesn't propagate the tracing headers from the incoming request, the sidecar proxy creates a root span for the outbound request. This means you will always see traces with only two microservices. On the other hand if the application container does propagate the tracing headers from incoming to outgoing requests, the sidecar proxy will create child spans as described above. Creation of the child spans gives you the ability to understand dependencies across multiple microservices.

There are couple of options for propagating tracing headers in your application.

  1. Look for tracing headers as mentioned in the istio docs and transfer the headers from incoming to outgoing requests. This method is simple and works in almost all cases. However, it has a major drawback that you cannot add custom tags to the spans like user information. You cannot create child spans related to events in the application which you might want to report. As you are simply transferring headers without understanding the span formats or contexts there is limited ability to add application specific information.
  2. The second method is setting up a tracing client in your application and use the Opentracing APIs to propagate tracing headers from incoming to outgoing requests. I have created a sample tracing-go package which provides an easy way to setup jaeger-client-go in your applications which is compatible with Istio. Following snippet should be included in the main function of your application:
       import (
         "log"

         "github.com/spf13/cobra"
         "github.com/spf13/viper"

         "github.com/aspenmesh/tracing-go"
       )

       func setupTracing() {
         // Configure Tracing
         tOpts := &tracing.Options{
           ZipkinURL:     viper.GetString("trace_zipkin_url"),
           JaegerURL:     viper.GetString("trace_jaeger_url"),
           LogTraceSpans: viper.GetBool("trace_log_spans"),
         }
         if err := tOpts.Validate(); err != nil {
           log.Fatal("Invalid options for tracing: ", err)
         }
         var tracer io.Closer
         if tOpts.TracingEnabled() {
           tracer, err = tracing.Configure("myapp", tOpts)
           if err != nil {
             tracer.Close()
             log.Fatal("Failed to configure tracing: ", err)
           } else {
             defer tracer.Close()
           }
         }
       }

The key point to note is in the tracing-go package I have set the Opentracing global tracer to the Jaeger tracer. This enables me to use the Opentracing APIs for propagating headers from incoming to outgoing requests like this:

   import (
     "net/http"
     "golang.org/x/net/context"
     "golang.org/x/net/context/ctxhttp"
     "ot "github.com/opentracing/opentracing-go"
   )

   func injectTracingHeaders(incomingReq *http.Request, addr string) {
     if span := ot.SpanFromContext(incomingReq.Context()); span != nil {
       outgoingReq, _ := http.NewRequest("GET", addr, nil)
       ot.GlobalTracer().Inject(
         span.Context(),
         ot.HTTPHeaders,
         ot.HTTPHeadersCarrier(outgoingReq.Header))

       resp, err := ctxhttp.Do(ctx, nil, outgoingReq)
       // Do something with resp
     }
   }

You can also use the Opentracing APIs to set span tags or create child spans
from the tracing context added by Istio like this:

   func SetSpanTag(incomingReq *http.Request, key string, value interface{}) {
     if span := ot.SpanFromContext(incomingReq.Context()); span != nil {
       span.SetTag(key, value)
     }
   }

Apart from these benefits you don't have to deal with tracing headers directly but the tracer (in this case Jaeger) handles it for you. I strongly recommend using this approach as it sets the foundation in your application to add enhanced tracing capabilities without much overhead.

Now let's move on to the third question.

How does spans reported by Istio interact with spans created by applications?

If you want the spans reported by your application to be child spans of the tracing context added by Istio you should use Opentracing API StartSpanFromContext instead of using StartSpan. The StartSpanFromContext creates a child span from the parent context if present else creates a root span.

Note that in all the examples above I have used Opentracing Go APIs but you should be able to use any tracing client library written in the same language as your application as long as it is Opentracing API compatible.