Search <book_title>...

NetBackup™ for Kubernetes Administrator's Guide

Last Published: 2024-09-30

Product(s): NetBackup & Alta Data Protection (10.5)

NetBackup Kubernetes operator become unresponsive if PID limit exceeds on the Kubernetes node

In Linux systems, there is an initd or system process running as PID 1 to reap zombie processes. Containers that do not have such an initd process would keep spawning zombie processes.

After certain time period these zombie processes accumulates and then reaches the max limit of PIDs set on the Kubernetes node.

In NetBackup Kubernetes operator, nbcertcmdtool spawns child processes to carry out certificate-related operations. On completion of the operation, the processes get orphan and are not reaped. Eventually it hits the max PID limit and NetBackup Kubernetes operator becomes unresponsive.

Error message: login pod/nbukops-controller-manager-67f5498bbb-gn9zw -c netbackupkops -n nbukops ERRO[0005] exec failed: container_linux.go:380: starting container process caused: read init-p: connection reset by peer a command that is terminated with exit code 1.

Recommended actions:

To fix the PID limit exceed issue, you can use the Initd script. Initd script acts as parent process or entry point script to the controller pod.
As a parent process it attaches zombie process to itself after the child process completion to terminate the persistent zombie process. It also helps you to shut down the container gracefully. Initd script is available in NBUKOPs build version 10.0.1.
Use the following steps to remove the existing nbcertcmdtool zombie processes:

Describe the NetBackup operator pod and find the Kubernetes node on which the controller pod is running. Run the command:
kubectl describe -c netbackupkops <NB k8s operator pod name> -n <namespace>
Log on to the Kubernetes node, run the command:
kubectl debug node/nodename
Terminate the nbcertcmdtool zombie processes, run the command:
ps -ef | grep "\[nbcertcmdtool\] <defunct>" | awk '{print $3}' | xargs kill -9

Note:

These steps terminate all the zombie processes for that worker node. But it resolves the issue temporarily. For a permanent solution, you must deploy a new KOps build with Initd script.