I have a 3-node cluster running on GKE. All the nodes are pre-emptible meaning they can be killed at any time and generally do not live longer than 24 hours. In the event a node is killed the autoscaler spins up a new node to replace it. This usually takes a minute or so when this happens.
In my cluster I have a deployment with its replicas set to 3. My intention is that each pod will be spread across all the nodes such that my application will still run as long as at least one node in my cluster is alive.
I've used the following affinity configuration such that pods prefer running on hosts different to ones already running pods for that deployment:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
weight: 100
When I scale my application from 0 this seems to work as intended. But in practice the following happens:
- Lets say pods belonging to the
my-appreplicasetA,BandCare running on nodes1,2and3respectively. So state would be:
1 -> A
2 -> B
3 -> C
- Node 3 is killed taking pod C with it, resulting in 2 running pods in the replicaset.
- The scheduler automatically starts to schedule a new pod to bring the replicaset back up to 3.
- It looks for a node without any pods for
my-app. As the autoscalar is still in the process of starting a replacement node (4), only1and2are available. - It schedules the new pod
Don node1 - Node
4eventually comes online but asmy-apphas all its pods scheduled it doesn't have any pods running on it. Resultant state is
1 -> A, D
2 -> B
4 -> -
This is not the ideal configuration. The problem arises because there's a delay creating the new node and the schedular is not aware that it'll be available very soon.
Is there a better configuration that can ensure the pods will always be distributed across the node? I was thinking a directive like preferredDuringSchedulingpreferredDuringExecution might do it but that doesn't exist.