Managing Kubernetes Cluster At Scale: Cluster Overprovisioning
Is Cluster Autoscaler alone enough ?
Cluster Autoscaler (CA) is part of the Kubernetes ecosystem that allows automatical addition or removal or nodes in a cluster based on resource request from pods. You may have heard of Horizonal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) but let’s save those topics for another day :P
In a nutshelf, Cluster Autoscaler is designed to perform two main tasks:
- Add more nodes to the K8s node group if there are pending pods. The default scan interval for pods is 10 seconds which you can modify using
--scan-interval
flag - Remove “underutilized” nodes from the cluster. What is the condition to scale down ? Its Sum ( CPU & Memory requests ) of all pods running on a node < threshold. The threshold default value is 50% but could be configured using
--scale-down-utilization-threshold
flag
That sounds great right. We can make our K8s cluster elastic while our SRE team can enjoy a good sleep without PagerDuty call. However, Cluster Autoscaler alone does not guarantee high availability of application
What’s the problem with Cluster Autoscaler?
Actually, it is not the problem with Cluster Autoscaler, it is about how fast we can get a node to join the cluster. It could take as long as 7 mins to get that node and running. Let me repeat, its 7 freaking minutes. During that 7 mins , our new pods will be sitting there in Pending
state and as a software engineer, you want that Pending
state time as short as possible
Actually, with some tweak and a combination of Pod Priority and Preemption and cluster-proportional-autoscaler, we can shorten this time to a matter of seconds
Pod Priority and Preemption
Just by looking at the names, we can definitely guess that these features help us to assign priorities to pods. We can group applications deployed by order of importance and assign priorities accordingly ( for example: overprovision, low, high, monitoring, system-eritical). When node resources are scarced, pods with lower priority are evicted to give way to higher priority pods to be scheduled
Interestingly, this means that we can create a lowest priority class just for overprovisioning. Those pods with overprovisioning
priority is called pause pod, with a sole purpose of reserving extra resources in our K8s cluster, which will be evicted to give space as long as “real” pods need to run.
This is how the flow is going to be:
- We have several pause pod running as buffer. All nodes are running at near full capacity
- A new deployment is created with higher priority than pause pod.
- Pause pods will be evicted. Its state changes from
RUNNING
toPENDING
while new deployment pods take the pause pods place in the current set of nodes - Cluster Autoscaler scan every 60 seconds and found out pause pod is running in
PENDING
state, and hence new nodes are created to host those pause pods
The reason of running actual pods for overprovisioning is more than just CPU and memory requests. It could be exhausted IP addresses which happen quite frequently if we run a big enough cluster. I’m sure many people have encountered this issue once when managing K8s cluster
Warning FailedCreatePodSandBox 10m kubelet, ip-10-60-200-192.ec2.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "abcxyz" network for pod "xxxxx-xxxxxserver-deployment-123456abc-xxxx": networkPlugin cni failed to set up pod "xxxxx-xxxxxserver-deployment-123456abc-xxxx_default" network: add cmd: failed to assign an IP address to container
There are articles on AWS to guide users to resolve this issue and I’m sure there are similar articles for other cloud providers too but let’s just use AWS as a representative for now
How much buffer to set?
Now we know what we can skillfully use Pod Priority to create pause pod acting as our buffer. The next question is deciding appropriate level of buffer. Would a 100-node cluster be given the same amount of buffer as a 10 node cluster? Surely we do not want to hardcode any absolute number of buffer nodes but rather a percentage of the cluster.
Luckly, there is cluster-proportional-autoscaler (CPA) which helps us to achieve this. CPA, to put it simply is just a deployment that dynamically increases/ decreases the number of replicas based on the total number of cores of the cluster and the number of nodes of the cluster which are set by coresPerReplica
and nodesPerReplica
respectively.
The reason we need both values is to account for two types of cluster: core heavy cluster and node heavy cluster. Core heavy means we can have a cluster of few nodes but each node is made up of a very powerful VM instance. Node heavy cluster is the opposite. It refers to cluster of many nodes but smaller instance type
kind: ConfigMap
apiVersion: v1
metadata:
name: nginx-autoscaler
namespace: default
data:
linear: |-
{
"coresPerReplica": 2,
"nodesPerReplica": 1,
"preventSinglePointFailure": true
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-autoscale-example
namespace: default
spec:
selector:
matchLabels:
run: nginx-autoscale-example
replicas: 1
template:
metadata:
labels:
run: nginx-autoscale-example
spec:
containers:
- name: nginx-autoscale-example
image: nginx
ports:
- containerPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-autoscaler
namespace: default
labels:
app: autoscaler
spec:
selector:
matchLabels:
app: autoscaler
replicas: 1
template:
metadata:
labels:
app: autoscaler
spec:
containers:
- image: gcr.io/google_containers/cluster-proportional-autoscaler-amd64:{LATEST_RELEASE}
name: autoscaler
command:
- /cluster-proportional-autoscaler
- --namespace=default
- --configmap=nginx-autoscaler
- --target=deployment/nginx-autoscale-example
- --logtostderr=true
- --v=2
Summary
Cluster Autoscaler + Pods Priority + Cluster-proportional-autoscaler = Profit