Reducing MTTR: How I learned to stop worrying and love the service mesh

Operators and Developers — like you — are the new kingmakers in modern organizations. While this is a glorious feat, it also means that you are on the hook as the core value drivers in application-centric organizations. Devs need to spend their time writing business logic that adds new value, which means that Ops must provide a stable, secure and scalable platform that removes the need to focus on infrastructure. So how do you get to where you need to be as a DevOps team?

graphic of person in front of graph
reduce mttr illustration

Let’s level set

Key terms for today: MTTR and observability. We’re not talking the old hardware MTTR here — mean time to repair. In the DevOps kingdom, it’s mean time to resolution, or just how good are you at fixing problems. ‘Nuff said.

Observability. To some it’s just a DevOps buzzword for monitoring, but we think of it as monitoring on steroids. More data. Different data (logs, tracing). The kind of stuff you get out of a service mesh. Plus, you get alerts. Even analytics. So you see what’s going on in your apps.

So why look at MTTR now?

Keeping MTTR as low as possible keeps customers happy with app performance and availability. Simple as that. It’s also a key metric for engineering efficiency. Which means you can do more with the time you’ve got. So how do you get there?

There are many products out there that allow you to configure thresholds/SLOs and alert-on violations. This is the simple part — and let’s be honest, most operators are already doing it. And there are products that get you to observability, drowning you in data in the process if you’re working in a scaled up container environment. But not only that, the really difficult part begins when that dreaded alert fires. What do you do next?

Story Time

Meet Jax. Jax is an application developer. He likes to write code for things that make life easier for someone.

If you were Jax, what would you prefer?

  • Spending hours trying to find the right data in order to discover when or where the problem occurred.
  • Narrowing down the results.
  • Creatively thinking of a way this could have happened
  • Having meetings to talk through possible solutions
  • Implementing and testing solutions
  • Finding a fix.
  • Hoping it doesn’t happen again.

  • Firing up the right tools to pinpoint the error.
  • Getting recommendation on how to fix the error.
  • Fixing the error and going back to work on the shiny new feature.
graphic of person sitting at desk coding

The right tools for the job

You’re scaling K8s to the moon and rolling out new apps and features like the rock stars you are. But you’re coming up against some challenges. Security for one, and stability. So many clusters, so many breaking changes, so little time to find them. You need a service mesh, but that’s a whole different rabbit hole. A service mesh will give you a whole new set of tools, and a lot of data for observability, but will they be the right tools?

If you want to improve your MTTR, you need a suite of tools on top of observability to help you. Here are some things to look for. You want tools that:

Show a Cluster-wide view

Give you a cluster-wide view of configuration, policies, and application status

Visualize errors

Visualize when errors occurred and add relevant performance and error context to help operators better understand the situation

Surface probable failures

Surface the most likely cause of failures to you can get to root cause faster

Fix config changes directly

Allow you to apply fixing config changes directly, without switching to another app

Report failures

Easily report the reason for failures and the steps taken to remediate

Suggest resolutions

Get suggested resolutions for policy and configuration related to threshold violation

How to talk to people about service mesh

Now you have a clearer idea of what the right service mesh can do to reduce MTTR — and give you a metric that shows its worth, and yours. So how do you go about talking to others about why it’s important for you to do your job better? Here’s our advice:

illustration of coworkers collaborating

Talk about your customers first: can these tools provide them with a better user experience?

illustration of woman sitting on large coins

Point out that adopting tools that help you save time actually saves the company money

graphic of people working on computers

Talk about how tools can help your organization through challenging times

Measuring your success over time

No plan is complete without some sort of checklist, right? Here are things to watch to make sure what you’ve worked so hard to implement is working right over time. Looks at all these things before and then after your your new tools are up and running:

How much time are you (or your team) spending on finding issues?

How successful are deployments to production?

Are your new tools helping to reduce downtime?

How much effort does it take to manage SLOs for your microservice architecture?

Are the new tools helping to increase efficiency across your team?

Want to learn more?

There’s one downside to being the kingmaker. All eyes are on you. We’re here to help you get the tools you need to win. Feel free to reach out to hello@aspenmesh.io or check out Service Mesh University to learn more about service mesh on your own.

Learn more