If Taint Based Evictions are present in the pod definition, controller manager will not be able to evict the pod that tolerates the taint. Even if you don't define an eviction policy in your configuration, it gets a default one since Default Toleration Seconds admission controller plugin is enabled by default.
Default Toleration Seconds admission controller plugin configures your pod like below:
tolerations:
- key: node.kubernetes.io/not-ready
effect: NoExecute
tolerationSeconds: 300
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
tolerationSeconds: 300
You can verify this by inspecting definition of your pod:
kubectl get pods -o yaml -n <namespace> <pod-name>`
According to above toleration it takes more than 5 minutes to recreate the pod on another ready node since pod can tolerate not-ready taint for up to 5 minutes. In this case, even if you set --pod-eviction-timeout to 20s, there is nothing controller manager can do because of the tolerations.
But why it takes more than 5 minutes? Because the node will be considered as down after --node-monitor-grace-period which defaults to 40s. After that, pod toleration timer starts.
Recommended Solution
If you want your cluster to react faster for node outages, you should use taints and tolerations without modifying options. For example, you can define your pod like below:
tolerations:
- key: node.kubernetes.io/not-ready
effect: NoExecute
tolerationSeconds: 0
- key: node.kubernetes.io/unreachable
effect: NoExecute
tolerationSeconds: 0
With above toleration your pod will be recreated on a ready node just after the current node marked as not ready. This should take less then a minute since --node-monitor-grace-period is default to 40s.
Available Options
If you want to be in control of these timings below you will find plenty of options to do so. However, modifying these options should be avoided. If you use tight timings which might create an overhead on etcd as every node will try to update its status very often.
In addition to this, currently it is not clear how to propagate changes in controller manager, api server and kubelet configuration to all nodes in a living cluster. Please see Tracking issue for changing the cluster and Dynamic Kubelet Configuration. As of this writing, reconfiguring a node's kubelet in a live cluster is in beta.
You can configure control plane and kubelet during kubeadm init or join phase. Please refer to Customizing control plane configuration with kubeadm and Configuring each kubelet in your cluster using kubeadm for more details.
Assuming you have a single node cluster:
- controller manager includes:
--node-monitor-grace-period default 40s
--node-monitor-period default 5s
--pod-eviction-timeout default 5m0s
- api server includes:
--default-not-ready-toleration-seconds default 300
--default-unreachable-toleration-seconds default 300
- kubelet includes:
--node-status-update-frequency default 10s
If you set up the cluster with kubeadm you can modify:
/etc/kubernetes/manifests/kube-controller-manager.yaml for controller manager options.
/etc/kubernetes/manifests/kube-apiserver.yaml for api server options.
Note: Modifying these files will reconfigure and restart the respective pod in the node.
In order to modify kubelet config you can add below line:
KUBELET_EXTRA_ARGS="--node-status-update-frequency=10s"
To /etc/default/kubelet (for DEBs), or /etc/sysconfig/kubelet (for RPMs) and then restart kubelet service:
sudo systemctl daemon-reload && sudo systemctl restart kubelet