Troubleshooting in k8s
Troubleshooting in k8s
first step is to check if kubernetes self applications are running fine or not.
Most probably you will get to know about error here, After fixing it reset kubelet with below
commands:-
systemctl daemon-reload
In case you still didn't get the root cause, check below things:-
Make sure your node has enough space and memory. Check for /var directory space especially.
command to check: -df -kh, free -m
Verify cpu utilization with top command and make sure any process is not taking an unexpected
memory.
Monitor Resource Usage: Use metrics-server or tools like Prometheus and Grafana to
monitor CPU and memory usage across the cluster.
By identifying the root cause, ensuring proper resource configuration, and scaling your cluster
appropriately, you can resolve and prevent "Resource Exhausted" errors. Regular monitoring and
proactive resource management are essential to maintaining a stable Kubernetes environment.
#2 CrashLoopBackoff error
When a Kubernetes container repeatedly fails to start, it enters a ‘CrashLoopBackOff’ state,
indicating a restart loop within a pod. This error often occurs due to various issues preventing the
container from launching properly.
When a Pod state is displaying CrashLoopBackOff, it means that it’s currently waiting the indicated
time before restarting the pod again.
Every time the pod is restarted, Kubernetes waits for a longer and longer time, known as a backoff
delay. The delay between restarts is exponential (10s, 20s, 40s, …) and is capped at five minutes.
During this process, Kubernetes displays the CrashLoopBackOff error.
Causes of CrashLoopBackOff
CrashLoopBackOff can occur when a pod fails to start for some reason, because a container fails to
start up properly and repeatedly crashes. Let’s review the common causes of this issue.
When the node that the pod is running on doesn’t have enough resources, the pod can be evicted
and moved to a different node. If none of the nodes have sufficient resources, the pod can go into
a CrashLoopBackOff state.
To resolve this issue, you need to understand the resource usage of your application and set the
appropriate resource requests and limits. You can use the kubectl describe pod [pod_name]
command to check if the pod was evicted due to insufficient memory.
You can also monitor the memory and CPU usage of your pods using Kubernetes metrics server or
other monitoring tools like Prometheus. If your application is consistently using more resources
than allocated, you might need to optimize your application, allocate more resources, or change
resources:limits in the Container’s resource manifest.
A best practice for fixing this error is ensuring you have the latest Docker version and the most
stable versions of other plugins. Thus, you can prevent deprecated commands and inconsistencies
that trip your containers into start-fail loops.
When migrating a project into a Kubernetes cluster, you might need to roll back several Docker
versions to meet the incoming project’s version.
Check the syslog and other container logs to see if this was caused by any of the issues we
mentioned as causes of CrashLoopBackoff (e.g., locked or missing files). If not, then the problem
could be with one of the third-party services.
To verify this, you’ll need to use a debugging container. A debug container works as a shell that can
be used to login into the failing container. This works because both containers share a similar
environment, so their behaviors are the same. Here is a link to one such shell you can use: ubuntu-
network-troubleshooting.
Using the shell, log into your failing container and begin debugging as you normally would. Start
with checking kube-dns configurations, since a lot of third-party issues start with incorrect DNS
settings.
Missing Dependencies
The CrashLoopBackOff status can activate when Kubernetes cannot locate runtime dependencies
(i.e., the var, run, secrets, kubernetes.io, or service account files are missing). This might occur when
some containers inside the pod attempt to interact with an API without the default access token.
This scenario is possible if you manually create the pods using a unique API token to access cluster
services. The missing service account file is the declaration of tokens needed to pass
authentication.
You can fix this error by allowing all new –mount creations to adhere to the default access level
throughout the pod space. Ensure that new pods using custom tokens comply with this access level
to prevent continuous startup failures.
Suppose you have a shared master setup and run an update that restarts all the pod services. The
result is several restart loops because Kubernetes must choose a master from the available options.
You can fix this by changing the update procedure from a direct, all-encompassing one to a
sequential one (i.e., applying changes separately in each pod). This approach makes it easier to
troubleshoot the cause of the restart loop.
In some cases, CrashLoopBackOff can occur as a settling phase to the changes you make. The error
resolves itself when the nodes eventually receive the right resources for a stable environment.
Doing this will let you identify the conflicting service. Using netstat, look for the corresponding
container for that service and kill it with the kill command. Delete the kube-controller-manager pod
and restart.
Troubleshoot CrashLoopBackOff
The best way to identify the root cause of the error is to start going through the list of potential
causes and eliminate them one by one, starting with the most common ones first.
☑If you get a Liveness probe failed and Back-off restarting failed container messages from the
kubelet, as shown below, this indicates the container is not responding and is in the process of
restarting.
From Message
----- -----
kubelet Liveness probe failed: cat: can’t open ‘/tmp/healthy’: No such file or directory
☑If you get the back-off restarting failed container message this means that you are dealing with a
temporary resource overload, as a result of an activity spike. The solution is to adjust
periodSeconds or timeoutSeconds to give the application a longer window of time to respond.
You originally ran kubectl get pods to identify the Kubernetes pod that was exhibiting the
CrashLoopBackOff error. You can run the following command to get the last 5 log lines from the
pod:
Search the log for clues showing why the pod is repeatedly crashing. If you cannot resolve the issue,
proceed to the next step.
This may also provide clues about issues at the application level.
Failing all the above, the next step is to bash into the CrashLoop container to see exactly what
happened.
However, it is important to identify the root cause of the CrashLoopBackOff error and prevent it in
the first place.
Best practices that can help you prevent the CrashLoopBackOff error.
1. Configure and Recheck Your Files
A misconfigured or missing configuration file can cause the CrashLoopBackOff error, preventing the
container from starting correctly. Before deployment, make sure all files are in place and configured
correctly.
In most cases, files are stored in /var/lib/docker. You can use commands like ls and find to verify if
the target file exists.
You can log into the container and manually reach the endpoints using curl to check.
Use env to inspect the environment variables and make sure they’re correct.
4. Check Kube-DNS
The application may be trying to connect to an external service, but the kube-dns service is not
running.
You just need to restart the kube-dns service so the container can connect to the external service.
#3 OOMKilled error
OOMKilled (Out of Memory Killed) occurs in Kubernetes when a container exceeds the memory limit
specified in its resource configuration. The Kubernetes scheduler, in coordination with the kubelet,
monitors container resource usage.
When a container uses more memory than its allocated limit, the kernel's Out of Memory (OOM)
Killer terminates the container to protect the system from running out of memory.
Possible Reasons Behind OOMKilled
1. Insufficient Memory Limits
The memory limit specified in the container's resource configuration is too low for its
actual workload.
2. Memory Leak in Application
The application running inside the container has a memory leak, causing it to consume
increasing amounts of memory over time.
3. Unexpected High Workload
A sudden surge in traffic or workload might cause the application to use more memory
than anticipated.
4. Improper Resource Allocation
Containers are deployed without specifying resource limits, leading to unbounded
memory usage and competition for system resources.
5. Misconfigured Applications
Applications are configured to use more memory than what the container is allowed.
6. Multiple Containers on the Same Node
If multiple containers are running on the same node, one container consuming excessive
memory can lead to eviction of others.
7. Node Resource Exhaustion
The node itself may not have enough memory to handle all the containers running on it.
8. Unoptimized Code or Queries
Poorly optimized application code or inefficient database queries could lead to excessive
memory usage.
The Node Not Ready status in Kubernetes indicates that a node in your cluster is not functioning
properly and cannot host any pods.
Nodes may enter a Not Ready state for a variety of reasons, ranging from network issues, resource
exhaustion, misconfigurations, or underlying hardware problems. Understanding and resolving the
root cause of this error is essential to maintain the operational efficiency and reliability of a
Kubernetes cluster.
In Kubernetes, Nodes can be in one of several states, reflecting their current status and ability to
accept workloads:
A node in the Ready state is healthy and capable of accepting new pods.
A node in the Not Ready state has encountered an issue that prevents it from functioning
correctly.
The Unknown state indicates that the Kubernetes master has lost communication with
the node, and its status cannot be determined.
To determine if a node is experiencing a Node Not Ready error, and obtain the information
necessary to solve the problem, follow these steps:
command provides comprehensive details about the node, including its conditions, events, and
configuration. This information is useful for diagnosing the root cause of the Not Ready status,
offering insights into any errors or warnings that the node might be experiencing. Analyzing this
output helps pinpoint specific issues, guiding the troubleshooting and resolution processes.
Here are a few things to notice in the output, which could indicate the cause of the problem:
You can access kubelet logs using journalctl or other logging utilities, depending on the node’s
operating system:
Possible Causes
1. Node Issues:
The kubelet service is not running.
Node has insufficient resources (CPU, memory, or disk).
Node has networking or connectivity issues.
Node is under maintenance or power-off.
2. API Server Issues:
The node cannot communicate with the Kubernetes API server.
3. Component Issues:
Missing or misconfigured critical components such as kubelet, container runtime (e.g.,
Docker, containerd), or kube-proxy.
Misconfigured cni (Container Network Interface) plugins.
4. Configuration Errors:
Outdated kubelet certificates or misconfigured kubelet on the node.
Issues with taints and tolerations.
5. Cloud-Specific Problems:
Cloud provider (e.g., AWS, GCP, Azure) misconfigurations.
Node not attached to the cluster due to IAM or role issues.
Scarcity of Resources
One common cause of the Node Not Ready error is the scarcity of resources, such as CPU or
memory exhaustion. Monitoring resource usage can help identify if this is the cause. If the node is
over-allocated, consider scaling down workloads or adding more nodes to the cluster.
kubelet Process
Restarting the kubelet might resolve some issues in the kubelet process. The command to restart
the kubelet varies depending on the system manager in use.
This command restarts the kubelet service, potentially resolving issues that prevent the node from
reaching a Ready state.
kube-proxy
Issues with kube-proxy, the network proxy running on each node, can also affect node readiness.
Checking the status of kube-proxy and restarting it if necessary can help:
This command checks the status of the kube-proxy service. If it’s not running as expected, it can be
restarted with:
Restarting kube-proxy can resolve network-related issues affecting the node’s ability to
communicate with the cluster, potentially resolving the Not Ready error.
Preventive Measures
We can have some preventive measures to take, so that we can safeguard our cluster to fall into the trap of
this error.
A Gateway Timeout error in a Kubernetes environment usually means that a service or application
within the cluster is taking too long to respond, or it is unreachable. Here’s a step-by-step guide to
help you troubleshoot and resolve this issue, along with the possible causes.
Possible Reasons
1. Application Issues
The backend service is slow or unresponsive due to high CPU or memory utilization.
Long-running queries or insufficient resources are causing delays.
2. Networking Issues
Misconfigured network policies or ingress rules.
A misbehaving load balancer or DNS resolution issues.
3. Service Configuration Issues
Timeout settings on the Ingress Controller, Service, or Load Balancer are too short.
Service or pod selectors in Kubernetes are misconfigured, leading to no backend pods
being targeted.
4. Pod Issues
Pods are in a CrashLoopBackOff, Pending, or Terminating state.
Horizontal Pod Autoscaler (HPA) is not scaling correctly to handle the traffic.
5. Ingress or Load Balancer Issues
Misconfigured ingress annotations.
Load balancer health checks are failing due to incorrect paths or ports.
Steps to Resolve
1. Check Application Health
2. Verify if the backend application is responsive.
3. Check logs of the application pods using:
kubectl logs <pod-name> -n <namespace>
Use kubectl exec to probe application readiness manually:
-If using a Load Balancer, confirm the health check settings are correct.
Update the timeout settings in the ingress controller or load balancer annotations. For NGINX
ingress:
7. Debug Networking
Test connectivity from the ingress controller to the pods:
8. Autoscaling