Overview
A pod has been deployed, and remains in a Pending state for more time than is expected.
Check RunBook Match
When running a kubectl get pods
command, you will see a line like this in the output for your pod:
NAME READY STATUS RESTARTS AGE
nginx-7ef9efa7cd-qasd2 0/1 Pending 0 1h
Initial Steps Overview
Detailed Steps
1) Gather information
To determine the root cause here, first gather relevant information that you may need to refer back to later:
kubectl describe pod -n [NAMESPACE] -p [POD_NAME] > /tmp/runbooks_describe_pod.txt
kubectl describe nodes > /tmp/runbooks_describe_nodes.txt
kubectl get componentstatuses > /tmp/runbooks_componentstatuses.txt
2) Examine pod Events
output.
Look at the Events
section of your /tmp/runbooks_describe_pod.txt
file.
2.1) If the last message is pulling image
then skip to Debug pulling image
.
2.2) If you see a FailedScheduling
warning with Insufficient cpu
or Insuffient memory
mentioned, you have run out of resources available to run your pod:
Warning FailedScheduling 40s (x98 over 2h) default-scheduler 0/1 nodes are available: 1 Insufficient cpu (1).
Warning FailedScheduling 40s (x98 over 2h) default-scheduler 0/1 nodes are available: 1 Insufficient memory (1).
Go to Solution B
2.3) If you see a FailedScheduling [...] 0/n nodes are available
warning
mentioned, you have run out of nodes available to assign this pod to.
Warning FailedScheduling 3m (x57 over 19m) default-scheduler 0/1 nodes are available: 1 MatchNodeSelector.
Skip to debug no nodes available.
2.4) If you see a cni config
error like this:
Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
then see Solution C
2.5) If you see pod has unbound immediate PersistentVolumeClaims
warnings
like this:
Warning FailedScheduling 7s (x15 over 17m) default-scheduler running "VolumeBinding" filter plugin for pod "podname-rh-0": pod has unbound immediate PersistentVolumeClaims
then consider:
- have you bound the same PersistentVolume to multiple pods (eg in a stateful set) and that these volumes can’t be concurrently bound to multiple pods?
accessModes need to be ReadWriteMany
if you want to have multiple pods access them.
See also here for more background on this.
- does the specified PersistentVolumeClaim exist?
TODO: separate 2.5 out into its own runbook and reference from here. cf: 1
3) Check the kubelet logs
If your pod has been assigned to a node, and you have admin access to that node, then it may be worth checking the kubelet logs for errors on that node.
Otherwise, you can run:
kubectl get nodes -o wide [NODE_NAME]
to check on the status of that node. If it does not appear ready, then run:
kubectl describe nodes [NODE_NAME]
4) Is this a coredns
or kube-dns
pod?
If so, this may be intentional behaviour. See here
5) Check the kubelet is running
If the kubelet is not running on the node the pod has been assigned to, this error may be seen.
You can check this in various ways that may be context-dependent, eg:
systemctl status [SERVICE_NAME]
To determine the SERVICE_NAME
above, you may want to run systemctl --type service | grep kube
to determine the service name.
If the kubelet is not running, go to restart kubelet
ps -ef | grep kubelet
6) Debug no nodes available
This might be caused by:
- pod demanding a particular node label
See here for more on pod restrictions and examine /tmp/runbooks_describe_pod.txt
to see whether the pod has any nodeSelectors set, and if so, whether any available nodes match these nodes.
- pod anti-affinity
See here for more on pod affinity and anti-affinity.
You may see more useful debug information in the original warning message in /tmp/runbooks_describe_pod.txt
:
Warning FailedScheduling <unknown> default-scheduler 0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate.
- nodes being busy
If your pod could not be scheduled because nodes were busy, then this step should have caught this.
7) Debug pulling image
event
The first thing to consider is whether the download of the image needs more time.
If you think you have waited a sufficient amount of time, then it may be worth re-running the describe pod
command from step 1 to see if any message has followed it.
If there is still no output, and you have admin access, you may want to log onto the node the pod has been assigned to and run the appropriate pull command (eg docker pull [IMAGE_NAME]
) on the image to see if the image is downloadable from the node.
8) Check component statuses
Examine the output of your /tmp/runbooks_componentstatuses.txt
file, looking for unhealthy components.
This is most commonly a problem when a cluster has just been stood up.
Solutions List
Solutions Detail
A) Restart kubelet
How exactly to restart the kubelet will depend on its process supervisor. The most common one is systemctl
:
systemctl restart [SERVICE_NAME]
If you don’t know how to restart the kubelet, you may need to contact your system administrator.
B) Allocate resources
Determine whether you need to increase the resources available, or limit resources your pod requests so as not to breach the limits. Which is appropriate depends on your particular circumstances. See the “0 nodes available” runbook for further guidance.
C) Repair your CNI
These links may help you resolve this problem:
More rarely, this has been suggested as a solution:
Check if docker and kubernetes are using the same cgroup driver. I faced the same issue (CentOS 7, kubernetes v1.14.1), and setting same cgroup driver (systemd) fixed it.
Source
Check Resolution
If the pod starts up with status RUNNING
according to the output of kubectl get pods
, then the issue has been resolved.
If there is a different status, then it may be that this particular issue is resolved, but a new issue has been revealed.
If it has not been resolved by this runbook, then please comment below.
Further Steps
None
Further Information
See here for background information on pod placement.