Kubernetes recreate pod if node becomes offline timeout

Question

I've started working with the docker images and set up Kubernetes. I have fixed everything but I am having problems with the timeout of pod recreations.

If one pod is running on one particular node and if I shut it down, it will take ~5 minutes to recreate the pod on another online node.

I've checked all the possible config files, also set all pod-eviction-timeout, horizontal-pod-autoscaler-downscale, horizontal-pod-autoscaler-downscale-delay flags but it is still not working.

Current kube controller manager config:

spec:
 containers:
 - command:
   - kube-controller-manager
   - --address=192.168.5.135
   - --allocate-node-cidrs=false
   - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
   - --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
   - --client-ca-file=/etc/kubernetes/pki/ca.crt
   - --cluster-cidr=192.168.5.0/24
   - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
   - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
   - --controllers=*,bootstrapsigner,tokencleaner
   - --kubeconfig=/etc/kubernetes/controller-manager.conf
   - --leader-elect=true
   - --node-cidr-mask-size=24
   - --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
   - --root-ca-file=/etc/kubernetes/pki/ca.crt
   - --service-account-private-key-file=/etc/kubernetes/pki/sa.key
   - --use-service-account-credentials=true
   - --horizontal-pod-autoscaler-downscale-delay=20s
   - --horizontal-pod-autoscaler-sync-period=20s
   - --node-monitor-grace-period=40s
   - --node-monitor-period=5s
   - --pod-eviction-timeout=20s
   - --use-service-account-credentials=true
   - --horizontal-pod-autoscaler-downscale-stabilization=20s
image: k8s.gcr.io/kube-controller-manager:v1.13.0

Thank you.

score 17 · Answer 1 · answered Sep 05 '19 at 12:53

If Taint Based Evictions are present in the pod definition, controller manager will not be able to evict the pod that tolerates the taint. Even if you don't define an eviction policy in your configuration, it gets a default one since Default Toleration Seconds admission controller plugin is enabled by default.

Default Toleration Seconds admission controller plugin configures your pod like below:

tolerations:
- key: node.kubernetes.io/not-ready
  effect: NoExecute
  tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
  operator: Exists
  effect: NoExecute
  tolerationSeconds: 300

You can verify this by inspecting definition of your pod:

kubectl get pods -o yaml -n <namespace> <pod-name>`

According to above toleration it takes more than 5 minutes to recreate the pod on another ready node since pod can tolerate not-ready taint for up to 5 minutes. In this case, even if you set --pod-eviction-timeout to 20s, there is nothing controller manager can do because of the tolerations.

But why it takes more than 5 minutes? Because the node will be considered as down after --node-monitor-grace-period which defaults to 40s. After that, pod toleration timer starts.

Available Options

If you want to be in control of these timings below you will find plenty of options to do so. However, modifying these options should be avoided. If you use tight timings which might create an overhead on etcd as every node will try to update its status very often.

In addition to this, currently it is not clear how to propagate changes in controller manager, api server and kubelet configuration to all nodes in a living cluster. Please see Tracking issue for changing the cluster and Dynamic Kubelet Configuration. As of this writing, reconfiguring a node's kubelet in a live cluster is in beta.

You can configure control plane and kubelet during kubeadm init or join phase. Please refer to Customizing control plane configuration with kubeadm and Configuring each kubelet in your cluster using kubeadm for more details.

Assuming you have a single node cluster:

controller manager includes:
- --node-monitor-grace-period default 40s
- --node-monitor-period default 5s
- --pod-eviction-timeout default 5m0s
api server includes:
- --default-not-ready-toleration-seconds default 300
- --default-unreachable-toleration-seconds default 300
kubelet includes:
- --node-status-update-frequency default 10s

If you set up the cluster with kubeadm you can modify:

/etc/kubernetes/manifests/kube-controller-manager.yaml for controller manager options.
/etc/kubernetes/manifests/kube-apiserver.yaml for api server options.

Note: Modifying these files will reconfigure and restart the respective pod in the node.

In order to modify kubelet config you can add below line:

KUBELET_EXTRA_ARGS="--node-status-update-frequency=10s"

To /etc/default/kubelet (for DEBs), or /etc/sysconfig/kubelet (for RPMs) and then restart kubelet service:

sudo systemctl daemon-reload && sudo systemctl restart kubelet

Prafull Ladha · Answer 2 · 2018-12-06T08:16:01.010

This is what happens when node dies or go into offline mode:

The kubelet posts its status to masters by --node-status-update-fequency=10s.
Node goes offline
kube-controller-manager is monitoring all the nodes by --node-monitor-period=5s
kube-controller-manager will see the node is unresponsive and has the grace period --node-monitor-grace-period=40s until it considers node unhealthy. PS: This parameter should be in N x node-status-update-fequency
Once the node marked unhealthy, the kube-controller-manager will remove the pods based on --pod-eviction-timeout=5m

Now, if you tweaked the parameter pod-eviction-timeout to say 30 seconds, it will still take

 node status update frequency: 10s
 node-monitor-period: 5s
 node-monitor-grace-period: 40s
 pod-eviction-timeout: 30s

Total 70 seconds to evict the pod from node The node-status-update-fequecy and node-monitor-grace-period time counts in node-monitor-grace-period also. You can tweak these variable as well to further lower down your total node eviction time.

This is my kube-controller-manager.yaml (present at /etc/kubernetes/manifests for kubeadm) file:

containers:
  - command:
    - kube-controller-manager
    - --controllers=*,bootstrapsigner,tokencleaner
    - --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
    - --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
    - --pod-eviction-timeout=30s
    - --address=127.0.0.1
    - --use-service-account-credentials=true
    - --kubeconfig=/etc/kubernetes/controller-manager.conf

I am effectively seeing my pods get evicted in 70s once I turn off my node.

EDIT2:

Run following command on master and check that the --pod-eviction-timeout comes as 20s.

[root@ip-10-0-1-12]# docker ps --no-trunc | grep "kube-controller-manager"

9bc26f99dcfe6ac0e7b2abf22bff67af6060561ee8c0cdff08e11c3a479f182c   sha256:40c8d10b2d11cbc3db2e373a5ffce60dd22dbbf6236567f28ac6abb7efbfc8a9                     
"kube-controller-manager --leader-elect=true --use-service-account-credentials=true --root-ca-file=/etc/kubernetes/pki/ca.crt --cluster-signing-key-file=/etc/kubernetes/pki/ca.key \
**--pod-eviction-timeout=30s** --address=127.0.0.1 --controllers=*,bootstrapsigner,tokencleaner --kubeconfig=/etc/kubernetes/controller-manager.conf --service-account-private-key-file=/etc/kubernetes/pki/sa.key --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt --allocate-node-cidrs=true --cluster-cidr=192.168.13.0/24 --node-cidr-mask-size=24"

If here --pod-eviction-timeout is 5m and not 20s then your changes are not applied properly.

Believe it or not, I've tried this and it's not working. I've copy&paste your configuration to my manager and pods are recreated after ~5min — Jure Potocnik, Dec 06 '18 at 07:44
Please check my edit to answer. There is a way to check if changes is applied or not. — Prafull Ladha, Dec 06 '18 at 08:16
Thank you. Yes the changes are applied properly: `kube-controller-manager --address=192.168.5.135 --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt --cluster-signing-key-file=/etc/kubernetes/pki/ca.key --controllers=*,bootstrapsigner,tokencleaner --kubeconfig=/etc/kubernetes/controller-manager.conf --use-service-account-credentials=true --pod-eviction-timeout=20s` — Jure Potocnik, Dec 06 '18 at 08:56
I've removed all the nodes and remove kubernetis to make a fresh installment. After that the error still occurs. Do i need some node health checks or sth? — Jure Potocnik, Dec 06 '18 at 19:06

Kubernetes recreate pod if node becomes offline timeout

2 Answers2

Recommended Solution

Available Options

Linked