3

A long running job (45h) is moved to another pod causing it to restart.

From the logs I can see that the job received a SIGTERM then it was restarted on another pod and probably on another node too.

The informations retrieved in google cloud are not helping. The pages Yaml or events do not describe this event except for the pod creation.

The job Yaml creationTimestamp: 2019-06-15T10:39:25Z

The pod Yaml creationTimestamp: 2019-06-17T13:26:25Z

I use mostly a default configuration 1.12.6-gke.11 with several of nodes and the servers are not preemptible.

Is it a default behavior of k8s ? If it is, how can I disable it ?

1 Answers1

2

Since you've said that you're using cluster autoscaling, I'm going to assume that the pod is getting removed because the cluster is getting scaled in. We saw a similar issue because we're running video transcoding jobs using a 0-scaled node pool (which then scales out as jobs are added).

Looking into it, we found the autoscaler documentation about the autoscaler and then modified our jobs accordingly:

What types of pods can prevent CA from removing a node?

  • Pods with restrictive PodDisruptionBudget.

  • Kube-system pods that:

    • are not run on the node by default, *
    • don't have a pod disruption budget set or their PDB is too restrictive (since CA 0.6).

Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *

Pods with local storage. *

Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, etc)

Pods that have the following annotation set: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

It was the last one that did the trick for us. I recommend using this as a starting point.