Overcome the limitations of KubernetesAutoscaler (HPA) to achieve reliable, dynamic autoscaling

Making HPA not suck and live up to its potential


Knowledge Hub

Browse articles, white papers and case studies.

EXPLORE

This paper will cover: 

  • How Kubernetes Horizontal Pod Autoscaler (HPA) works 
  • Biggest obstacles users face today using HPA 
  • How custom metrics adapters are used to try to overcome limitations 
  • An approach that optimizes HPA’s functionality to make it possible to dynamically autoscale reliably and without expensive overprovisioning. 

Common scaling problem with Kubernetes:
A retailer with a large e-commerce site struggles with huge peaks in demand followed by big drops. These can be caused by a number of business events, such as a large sale, new product introduction, or times like weekends and evenings. Predicting these peaks and valleys is difficult, since they are usually business driven, not driven by technology.  

Kubernetes’ Horizontal Pod Autoscaler (HPA) isn’t working how they hoped to handle the changes in traffic — it often fails to scale up the correct number of replicas fast enough, causing performance degradation. They need a solution that can learn the patterns of their e-commerce site and have the capacity ready for any traffic spikes in real-time. Their goal is to have a highly responsive dynamic scaling solution that meets demand and gives them peace of mind that their application is always performant. 

About Kubernetes Horizontal Pod Autoscaler (HPA)

Kubernetes can autoscale your applications based on metric thresholds such as CPU and memory utilization using the Kubernetes Horizontal Pod Autoscaler (HPA) resource. The goal of HPA is critical in ensuring that your application can always handle the current demand to meet your SLOs and optimize the amount of resources that your application uses.  

Note that adding or removing replicas is known as horizontal scaling; adding or removing resources (e.g., CPU or memory) to existing pods is referred to as vertical scaling. 

In practice, most people find that the default HPA falls short of their needs. To understand why this is, we must first understand how the HPA works. 

How HPA works

HPA is implemented as an intermittent control loop. During each loop, the HPA controller queries the Resource Metrics API for pod metrics or a custom metrics API for custom metrics (more on this in a bit).  The HPA controller collects the metrics for all pods that match the HPA’s selector specification and takes the mean of those metrics to determine how many replicas there should be. 

The formula to determine the number of replicas that should be running is pretty simple:

dR=ceiling(cR∗(cM / dM))

Where:
dR = desired replicas
cR = current replicas
cM = current metric value
dM = desired metric value

The number of replicas of a pod are then adjusted to the desired replicas value. The HPA controller adds or removes replicas incrementally to avoid peaks or dips in performance, and to lessen the impact of fluctuating metrics. 

The HPA can also be configured to use a custom metrics API to provide custom metrics to the HPA controller. The HPA controller uses custom metrics as raw metrics, meaning no utilization is calculated.  A custom metrics API is useful for scaling on non-pod metrics, such as HTTP metrics. 

More details on how HPA works can be found in Kubernetes documentation here. 

The Challenges with HPA

HPA is a powerful construct in Kubernetes and promises to help deliver on the value of cloud native applications by ensuring that there are always the right number of replicas of a pod to meet the SLOs and resource utilization requirements of an application. However, there are several limitations currently in HPA, and users, in practice, aren’t realizing the value that they hoped from HPA.

Pod-level metrics don’t reflect performance of different containers 

By default, metrics are retrieved for a pod, not container. Often pods have multiple containers, such as a sidecar for logging and an API (this is almost always the case in Istio). Since the HPA controller is gathering pod-level metrics, these metrics can be skewed because performance characteristics are usually much different between containers within the same pod. 

Note that there is experimental support container-level metrics; this feature is currently in alpha status.

Threshold-based metrics don’t fit many cases

Autoscaling relies on some threshold value of a metric being crossed. For example, you could define an HPA that scales out or in when a threshold of 80% CPU utilization is crossed:
kubectl autoscale deployment my-app –cpu-percent=80 –min=3 –max=6 

In this example, the deployment ‘my-app’ will be scaled to a maximum of 6 replicas when the CPU utilization exceeds 80%. On the surface, this seems like a good way to make sure that CPU utilization due to spikes in traffic is maintained at an acceptable level. 

However, due to the way that autoscaling works, this often doesn’t address performance requirements acceptably. When the 80% CPU utilization threshold is crossed, Kubernetes will add replicas based on the HPA policy defined, wait for a period of time (default is 15 seconds) and remeasure the metric. If the threshold is still exceeded, Kubernetes will add more replicas, remeasure the metric, and so on, until the metric drops below the threshold, or the maximum number of replicas is reached. 

Often, increases in traffic that cause performance drops are sustained. This may be due to business events such as sales or promotions, onboarding new users, or a few other unanticipated causes. Due to the incremental additions of replicas by Kubernetes, by the time enough replicas are added to handle the sustained increase in load, it may be too late and significant degradation or outages have occurred. 

Autoscaling Algorithm is too simplistic to be reliable

The algorithm used to determine the desired number of replicas for a deployment is pretty simple and can lead to inaccurate calculations in many situations. Since the algorithm is a simple ratio of the current metric value to the desired metric value across target pods, there are several scenarios that can affect the calculation: 

  • Pods that are not in a Ready state (Initializing, Failed, etc.) 
  • Pods that are missing metrics for some reason 
  • Multiple metrics with incongruent units (utilization vs raw metrics for example) 

This often leads to more replicas than needed, or even worse, not enough replicas, which can cause an outage. 

Custom Metrics Adapters require Ops implementation and still fall short

Kubernetes supports custom metrics adapters to supply custom metrics to the HPA controller, to extend HPA metrics beyond pod CPU and memory. These adapters must follow Kubernetes’ custom metrics API specification. By implementing a custom metrics adapter, you can define autoscaling rules for any arbitrary metric, such as HTTP throughput or custom metrics emitted by an application. A common custom metrics adapter is a Prometheus adapter, allowing you to write PromQL queries to fetch metrics from a Prometheus server. 

Using custom metrics adapters is a powerful way to define autoscaling rules for metrics that are meaningful to meeting application SLOs. But they also introduce a lot of operational overhead. In order to capture meaningful metrics, you’ll probably need to build your own adapter. This requires your DevOps or Platform Engineering team to develop, test, certify, operate, patch, and then support a critical piece of infrastructure for applications. After all this additional work, the fact remains that your custom metrics adapters are still threshold-based, creating the same challenges as the built-in Kubernetes metrics adapter. 

Hard to Define Meaningful SLIs as Metrics

It can be difficult to define what metrics, either built-in pod metrics or custom metrics, truly indicate when your application needs to scale out or in. The solution is a complex combination of SLIs, container metrics, supporting infrastructure like network and storage, as well as external dependencies like database or queue connections. And unless your application has been running in production for a while, any starting point for these indicators is an educated guess at best. 

Most developers understand behavioral indicators that an application needs to be scaled, like “database queries become slow during certain times of the day”, or “my application uses more memory than I expected when the number of requests grow.” But they probably can’t translate these behaviors into actionable metrics and indicators that can be encoded into YAML. Often, developers take a blanket approach to ensure they maintain their SLOs. They define very conservative thresholds in the hope that it covers them no matter what may happen. This inevitably leads to over-provisioned, under-utilized platform resources, essentially dismissing the power inherent in a cloud platform to dynamically optimize app performance.

Aspen Mesh Optimizes Predictive Scaling with HPA

At Aspen Mesh, we believe that Horizontal Pod Autoscaler is a powerful and necessary feature in a cloud-native platform. Unfortunately, the challenges of HPA often outweigh the value. HPA users frequently choose to overprovision applications rather than rely on HPA to help meet SLOs. At Aspen Mesh we are making it possible to take full advantage of HPAs’ capabilities. We are using the HPA API to give you what you need to turn on HPA and ensure it’s optimized without locking you into a proprietary solution.  

To summarize, the shortcomings of HPA are centered around three areas:  

  1. Threshold-based, reactive scale events 
  2. The lack of knowledge of when to scale
  3. Primitive metrics used to target autoscaling 

To address these challenges and to make HPA the valuable integrant of a cloud-native platform that it promises to be, Aspen Mesh is introducing its Predictive Scaling for Kubernetes and Istio. Predictive Scaling leverages the HPA API, and uses machine learning and rich telemetry data available from your applications to model your applications’ behavior and give insight into when applications should be scaled out or in. 

Aspen Mesh’s Predictive Scaling addresses each of the critical shortcomings of HPA 

  1. Uses ML models to learn and predict your applications’ behaviors over time, obviating thresholds and reactive scaling. 
  2. Gives you insight into your applications’ behaviors over time so that you don’t have to take a guess at a starting point for thresholds and dial them in over time. 
  3. Leverages all telemetry data available for your application, including HTTP metrics, application metrics, Pod and container metrics, and platform metrics, to create a 360º view of your application. No more relying on one or two unrelated metrics. Over time we will add trace and log data to our models. 

Request Early Access now to get full access to the Aspen App Intelligence Platform and be the first to try Aspen Mesh’s Predictive Scaling and other solutions as we release them. Getting started takes just a few minutes, and you’ll get new insight into your app’s behavior on Day One. And don’t hesitate to reach out to have a conversation with us anytime.