Patrick Jones IT Fun 07.20.2021 19:Jul PM EDT

The Most Common Cluster Failures in Kubernetes and How to Troubleshoot Them

When dealing with software applications, we need to maintain and monitor both the application and its underlying infrastructure. In containerized environments, the underlying container orchestration platform is one of the crucial aspects that need to be well maintained in order to avoid any cluster issues. If a cluster failure occurs in a Kubernetes environment, we need to identify the root cause and fix the issue.This process is called Kubernetes troubleshooting.

Troubleshooting Kubernetes cluster failures can be a complex, costly, and tedious task. Even though Kubernetes has some inbuilt monitoring and troubleshooting tools, they lack the functionality and ease of use when dealing with large deployments. The first place a Kubernetes admin should look for when troubleshooting cluster issues is logs. Logs in master nodes and worker nodes will definitely assist in the troubleshooting process.

In this article, we'll have a look at the most common cluster issues that can occur when dealing with the Kubernetes orchestration platform, so that you know where to look when troubleshooting cluster failures in Kubernetes.

Network connectivity issues

The most common issue is network connectivity. This can happen with both internal communications inside the cluster or external communications outside the Kubernetes cluster. Primary reasons for these connectivity issues include misconfigurations such as incorrectly exposed services, incorrect DNS entries, and incorrect network model/plugin configurations. Additionally, even simple issues like not providing a correct selector in the YAML file or not exposing the necessary ports in the container can cause connectivity issues.

Networking becomes even more complex with microservices-based architectures as a single connectivity issue can impact the whole application. The application may use different deployment types such as ReplicaSet, StatefulSet, or DaemonSet, yet the underlying network configurations must match it. For instance, you should ensure that each Pod has a unique network identifier (Stable Network ID) in a StatefulSet. Likewise, you have to always match the network configurations according to the application or deployment type.

Another factor that causes connectivity issues is external services like firewalls. Usually, load balancers or firewalls will be configured outside the Kubernetes cluster to manage traffic to the cluster or provide an additional layer of protection. You might waste precious time diagnosing the cluster network to discover the issue, yet it is with a simple rule in the firewall.

Data and Storage failures

Data loss and unavailability also contribute to cluster failures. These kinds of issues are more prevalent when dealing with persistent volumes of various volume types (plugins) pointing to third-party services like AWS elastic block storage, azure file storage, etc., as Kubernetes depends on them to function properly. These issues also point back to networking issues since communications errors with storage locations will directly impact the cluster.

With the "Storage Object in Use Protection feature," Persistence Volume Claims(PVC) can not be deleted while it's being used by an active Pod as this may result in data loss. However, issues arise when a PVC is deleted, and we are unable to reclaim it. There are two options to reclaim;

Using the Retain policy where data stored in the external sources are unaffected when the PV is deleted.
Completely delete the PV and external data storage using the delete policy.

A user might mistakenly use the delete policy, which can cause catastrophic cluster failure, with the only recovery option being a backup or a snapshot of the data.

The dynamic provisioning enables users to bind and provision PVs dynamically. A volume claim will remain unbound if it cannot find a matching volume, leading to cluster and application issues since there's no actual data storage location. This is the common scenario when dealing with under-provisioned persistent volumes where the volume size is less than the requested amount.

Infrastructure Configurations errors

Another reason for cluster issues is configuration errors in the Kubernetes cluster. Any issues in the control plane (api-server, etcd, scheduler, etc.) or service in a node such as kube-proxy can cause cluster failures. Therefore, users need to configure the Kubernetes cluster properly before deploying containers. These issues mostly affect custom Kubernetes deployments since most managed Kubernetes services such as Amazon EKS, Azure Kubernetes Service ensure the functionality of the core services.

Pods and Images issues

The next common factor is Pod-related issues. Insufficient resources (CPU/Memory) or issues with the network or volume provider can make Pods stay in the Pending state or stuck in CrashLoopBackOff. The other factor is container image, as it will cause ImagePullBackOff or ErrImagePull errors if Kubernetes is unable to pull the required image. All these will contribute to cluster failures and errors.

Authentication problems

Kubectl will not be able to identify or access resources without proper authentication credentials. This can happen when the Kubernetes authentication credentials (/etc/Kubernetes/admin.conf) are not properly set up or Role-Based Access Control (RBAC) is not configured. Therefore, we need to enable RBAC, create and assign RBAC policies for all the resources and users to ensure they have access to the Kubernetes cluster.

Application errors

Errors in the application or containers can also lead to Kubernetes cluster failures. For example, a vulnerability in the application can expose the whole cluster to malicious attacks. Moreover, insufficient resource availability for the cluster can cause the Pods to become unresponsive. It will also affect the scaling and availability of the application and ultimately lead to performance degradation of the cluster.