Pod Stuck in Pending State
Lastmod: 2020-07-31

Overview

A pod has been deployed, and remains in a Pending state for more time than is expected.

Check RunBook Match

When running a kubectl get pods command, you will see a line like this in the output for your pod:

NAME                     READY     STATUS             RESTARTS   AGE
nginx-7ef9efa7cd-qasd2   0/1       Pending            0          1h

Initial Steps Overview

  1. Gather information

  2. Examine pod Events output

  3. Check kubelet logs

  4. Is this a coredns or kube-dns pods?

  5. Check kubelet is running

  6. Debug no nodes available

  7. Debug pulling image

  8. Check component statuses

Detailed Steps

1) Gather information

To determine the root cause here, first gather relevant information that you may need to refer back to later:

kubectl describe pod -n [NAMESPACE] -p [POD_NAME] > /tmp/runbooks_describe_pod.txt
kubectl describe nodes > /tmp/runbooks_describe_nodes.txt
kubectl get componentstatuses > /tmp/runbooks_componentstatuses.txt

2) Examine pod Events output.

Look at the Events section of your /tmp/runbooks_describe_pod.txt file.

2.1) If the last message is pulling image

then skip to Debug pulling image.

2.2) If you see a FailedScheduling warning with Insufficient cpu or Insuffient memory

mentioned, you have run out of resources available to run your pod:

  Warning  FailedScheduling  40s (x98 over 2h)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu (1).
  Warning  FailedScheduling  40s (x98 over 2h)  default-scheduler  0/1 nodes are available: 1 Insufficient memory (1).

Go to Solution B

2.3) If you see a FailedScheduling [...] 0/n nodes are available warning

mentioned, you have run out of nodes available to assign this pod to.

  Warning  FailedScheduling  3m (x57 over 19m)  default-scheduler  0/1 nodes are available: 1 MatchNodeSelector.

Skip to debug no nodes available.

2.4) If you see a cni config error like this:

Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

then see Solution C

2.5) If you see pod has unbound immediate PersistentVolumeClaims warnings

like this:

Warning  FailedScheduling  7s (x15 over 17m)  default-scheduler  running "VolumeBinding" filter plugin for pod "podname-rh-0": pod has unbound immediate PersistentVolumeClaims

then consider:

  • have you bound the same PersistentVolume to multiple pods (eg in a stateful set) and that these volumes can’t be concurrently bound to multiple pods?

accessModes need to be ReadWriteMany if you want to have multiple pods access them.

See also here for more background on this.

  • does the specified PersistentVolumeClaim exist?

TODO: separate 2.5 out into its own runbook and reference from here. cf: 1

3) Check the kubelet logs

If your pod has been assigned to a node, and you have admin access to that node, then it may be worth checking the kubelet logs for errors on that node.

Otherwise, you can run:

kubectl get nodes -o wide [NODE_NAME]

to check on the status of that node. If it does not appear ready, then run:

kubectl describe nodes [NODE_NAME]

4) Is this a coredns or kube-dns pod?

If so, this may be intentional behaviour. See here

5) Check the kubelet is running

If the kubelet is not running on the node the pod has been assigned to, this error may be seen.

You can check this in various ways that may be context-dependent, eg:

  • systemctl status [SERVICE_NAME]

To determine the SERVICE_NAME above, you may want to run systemctl --type service | grep kube to determine the service name.

If the kubelet is not running, go to restart kubelet

  • ps -ef | grep kubelet

6) Debug no nodes available

This might be caused by:

  • pod demanding a particular node label

See here for more on pod restrictions and examine /tmp/runbooks_describe_pod.txt to see whether the pod has any nodeSelectors set, and if so, whether any available nodes match these nodes.

  • pod anti-affinity

See here for more on pod affinity and anti-affinity.

You may see more useful debug information in the original warning message in /tmp/runbooks_describe_pod.txt:

Warning  FailedScheduling  <unknown>  default-scheduler  0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate.
  • nodes being busy

If your pod could not be scheduled because nodes were busy, then this step should have caught this.

7) Debug pulling image event

The first thing to consider is whether the download of the image needs more time.

If you think you have waited a sufficient amount of time, then it may be worth re-running the describe pod command from step 1 to see if any message has followed it.

If there is still no output, and you have admin access, you may want to log onto the node the pod has been assigned to and run the appropriate pull command (eg docker pull [IMAGE_NAME]) on the image to see if the image is downloadable from the node.

8) Check component statuses

Examine the output of your /tmp/runbooks_componentstatuses.txt file, looking for unhealthy components.

This is most commonly a problem when a cluster has just been stood up.

Solutions List

A) Restart kubelet

B) Allocate resources

C) Repair your CNI

Solutions Detail

A) Restart kubelet

How exactly to restart the kubelet will depend on its process supervisor. The most common one is systemctl:

systemctl restart [SERVICE_NAME]

If you don’t know how to restart the kubelet, you may need to contact your system administrator.

B) Allocate resources

Determine whether you need to increase the resources available, or limit resources your pod requests so as not to breach the limits. Which is appropriate depends on your particular circumstances. See the “0 nodes available” runbook for further guidance.

C) Repair your CNI

These links may help you resolve this problem:

Install the CNI provider

More rarely, this has been suggested as a solution:

Check if docker and kubernetes are using the same cgroup driver. I faced the same issue (CentOS 7, kubernetes v1.14.1), and setting same cgroup driver (systemd) fixed it. Source

Check Resolution

If the pod starts up with status RUNNING according to the output of kubectl get pods, then the issue has been resolved.

If there is a different status, then it may be that this particular issue is resolved, but a new issue has been revealed.

If it has not been resolved by this runbook, then please comment below.

Further Steps

None

Further Information

See here for background information on pod placement.

Kubelet logs

Owner

Ian Miell

comments powered by Disqus