Veritas InfoScale™ for Kubernetes Environments 8.0.220 - Linux
- Overview
- System requirements
- Preparing to install InfoScale on Containers
- Installing Veritas InfoScale on OpenShift
- Installing InfoScale on a system with Internet connectivity
- Installing InfoScale in an air gapped system
- Installing Veritas InfoScale on Kubernetes
- Prerequisites
- Tagging the InfoScale images on Kubernetes
- Installing InfoScale on Kubernetes
- Configuring KMS-based Encryption on an OpenShift cluster
- Configuring KMS-based Encryption on a Kubernetes cluster
- InfoScale CSI deployment in Container environment
- Dynamic provisioning
- Snapshot provisioning (Creating volume snapshots)
- Managing InfoScale volume snapshots with Velero
- Volume cloning
- Installing and configuring InfoScale DR Manager on OpenShift
- Installing and configuring InfoScale DR Manager on Kubernetes
- Disaster Recovery scenarios
- Configuring InfoScale
- Administering InfoScale on Containers
- Upgrading InfoScale
- Troubleshooting
Known Issues
Following are the issues observed during testing in an internal test environment. The issues have remained unresolved and you might encounter these issues. The issue is described and if a workaround exists to resolve the issue, it is mentioned.
Note:
Workaround is a temporary solution to the issue. Veritas is working towards fixing the issue.
Table: Issue description and workaround
Description | Workaround |
---|---|
On a setup with multiple Network Interface Cards, deployment of SRO might fail as the underlying NFD worker pods are not reachable to the NFD master (4045909) | Set the appropriate or Keep only the valid Network Interface Card interfaces online |
Deleting InfoScale pods by using oc delete pods or kubectl delete pods command might result in Configuration failure. (4045599) | To undeploy InfoScale, delete by using InfoScale CR procedure. |
On a VMware Virtual Machine, deployment of InfoScale on OpenShift or Kubernetes fails if Disk UUID is not enabled. (4046388) | Enable Disk UUID before deployment. See Enabling disk UUID on virtual machines. |
For space-optimized snapshots, if writes on volume are more than size of cache object size, a write error is observed. This leads to snapshot volume getting marked as INVALID after detaching the mirror. (4043239) | Use space-optimized snapshots only in cases where expected rate of data change is much smaller than actual data volume size. |
When a file system is 100% full, and PVC resize is attempted, allocating space for the metadata or the config files required for file system resize might fail, causing PVC resize failure. (4045020) | Contact Veritas support for system recovery. |
Deployment of pods with PVC which are restored from a snapshot or are cloned from another PVC and is initiated in ReadOnlyMany(ROM) access mode fails. Deployment goes into state. (4040975) | Set the following deployment parameters to True - and . |
Disk initialization performed by using the vxdisksetup command fails with the following error message - | Ensure that the disk does not belong to any other diskgroup. If it does not belong to any other diskgroup, the disk might have some stale metadata. Run vxdiskunsetup on the disk and try disk initialization again. |
A message 'File missing or empty: /boot/grub/menu.lst' is displayed even after a successful disk initialization. (4039351) | Ignore the message. |
When majority of the nodes in a cluster go in a 'NotReady' state, fast failover (kube-fencing) panics nodes. InfoScale fencing panics rest of the nodes. Even after the cluster is back with majority of the nodes in a 'Ready' state, unfinished kube-fencing jobs continue to panic nodes. (4044408) | Manually delete the kube-fencing jobs till the cluster is up. |
With a heavy workload, node goes in a 'NotReady' state and InfoScale pods are getting killed. (4044963) | OpenShift Container Platform (OCP) runs extra system pods which consume memory. With heavy workloads, pods are killed to clear memory. Try the following -
|
When a back enclosure is disabled and enabled in a cvm-slave node, disk fails to attach back to the disk group. (4046928) | Run /etc/vx/bin/vxreattach. |
After faulting a slave node, one or more volumes do not get mounted or existing volumes get unmounted inside application pod. (4044533) | Reschedule/restart the pod to mount the volume. |
If a worker node is powered off or rebooted, the node goes into emergency shell and enters NotReady state, thus becoming inaccessible. (4053892) | Reinstall or reconfigure the control plane on the worker node. See OpenShift documentation. |
After creating PVC in RWX mode, data written on an application pod running on one node is not accessible from an application pod scheduled on another node. (4046460) | See https://access.redhat.com/solutions/6153272 for the recommended solution. |
In container form factor if public/private NICs to be used for LLT are bonded and the underlying bonded NICs have been configured on the same switch,then the worker nodes on which InfoScale is configured might panic randomly with a message - kernel BUG at mm/slub.c:305! . (4048786) | If NIC bonding is required for the LLT links, ensure that the underlying NICs are configured on different switches to avoid the kernel node panic, even though the crash has no functional impact. If private links are connected to the same switch, the bond mode must be Active-Backup |
Disks from a node of the InfoScale cluster do not get added to the disk group - | Add these disks to the disk group manually from the |
During InfoScale deployment, InfoScale configuration is not complete on one of the nodes and the node remains in a 'Not ready' state (4047598) | Remove |
If InfoScale is undeployed on all nodes while retaining the disk groups, re-creating InfoScale cluster on some nodes and adding nodes to the cluster fails. (4047205) | Undeploy and re-create InfoScale cluster on identical number of nodes. To undeploy InfoScale on all nodes and re-create InfoScale clusters on some nodes, contact Veritas support. |
After restoring space-optimized snapshot to new PVC, mount on restored PVC may fail if the source snapshot volume is detached (4012858) | Try one of the following -
|
CSI controller pod remains in 'Terminating' state in case of graceful node shutdown or power-off (4011482) | Try one of the following -
|
CSI node pods does not get rescheduled on other worker nodes when its parent node is drained (4011384) | None |
While restoring a snapshot, PVC goes into pending state after rebooting all nodes except the master node in the cluster (4014525, 4048825) | Delete the snapshot volume by using the vxedit command. Kubernetes automatically reattempts to create a volume snapshot again. |
In case of storage failure, application IOs to the mountpoint inside container fails and pod goes into CreateContainerConfigError or Error state (4011219, 4014758, 4015259) | Manually restart the application pod after the storage failure is resolved. |
Current application on the primary must not be deleted until it is clear that DR is possible. In some cases, DR fails and application gets deleted on the primary.(4047475) | Ensure that the peer clusters are connected and Data Replication is in a healthy state. |
If all worker nodes go down at the same time, InfoScale availability configuration is lost (4050355) | After recovery, InfoScale configuration is re-created. It might take up to 20 minutes. |
If applications on a cluster with Load Balancer configuration are migrated, Load balancer service appears in 'Pending' state if the target cluster's Load balancer IP addresses are different. (4051429) | If you are using Load Balancer service, use DNS custom resources to manage DNS endpoint mapping. |
Delete datarep operation goes in an unresponsive state and force delete in CR fails. (4050857) | Complete the following steps to clean up
See Veritas Volume Replicator and Veritas Cluster Server documentation for details. |
In case DR migrate fails to complete and running the command kubectl describe datareplication.infoscale.veritas.com/<datarep_name> returns a message vradmin migrate command failed. (4053632) | Complete the following steps
tutil is cleared and DR migration completes |
Kube-fencing is not functional when REST service and InfoScale operator pod is not running. (4054545) | None |
In VVR environment, vradmin migrate might fail with the following error message VxVM VVR vxrvg ERROR V-5-1-1617 giving up: utility fields must be cleared by executing: vxedit -f set tutil0="" <rvg> (4057713) | Run the following command - vxedit -f set tutil0="" <rvg> to clear |
During cluster configuration or while adding new nodes on OpenShift or Kubernetes, node join might fail if the disks in the cluster have old disk group records from previous deployments. Output similar to the following indicates old disk group records. vxvm:vxconfigd[238047]: V-5-1-11092 cleanup_client: (Disk in use by another cluster) 223 esxd05vm06 kernel: VxVM vxio V-5-0-164 Failed to join cluster infoscale_22670, aborting : (4057178) | Reset the cluster ID on the disks of joiner node, in order to allow the node to join.
|
Stale InfoScale kernel modules might be left over after undeploying InfoScale cluster on OpenShift or Kubernetes.(4042642) | Before deploying InfoScale on OpenShift or Kubernetes, check if any stale InfoScale kernel modules ( |
LLT tools is unable to detect duplicate InfoScale cluster id on OpenShift.(4057800) | Re-deploy InfoScale CR to avoid duplicate cluster id match. |
RLINK detach is observed while replication autosync is in progress between the primary and secondary clusters. (VIKE-1290) | Run kubectl describe datarep datareplicationName to check datareplication status. If |
Failed to resize volume post migration, if storage class consumed by application pod's pvc is not present on secondary cluster - the new primary. (VIKE-1385) | Create Storage Class (SC) manually on the new primary cluster, which must be same as the old primary cluster. |
On a Kubernetes cluster, tcp traffic reaches the node but does not get forwarded to the pod. (VIKE-1108) | Disable tx and rx offload on all nodes for calico tunnel device. |
Migration or takeover fails if the namespace being backed up pre-exists on the target cluster with different SCC (Security context constraints)- related annotations like uid-range, supplemental-groups, and seLinuxOptions.(VIKE-1277) | None |
Takeover operation fails if it is initiated when 'Disaster Recovery plan' is configured before 'Data replication configuration' is complete on the primary cluster. (VIKE-1294) | Wait for the following error in Disaster Recovery Plan CR status to go away before attempting takeover - metadataSyncStatus: Metadata backup transfer failed for all peer clusters or Wait for data replication CR status to be consistent up-to-date before applying Disaster Recovery plan CR as mentioned in 'Configuring Data Replication'. |
On OCP 4.9, | Run oc edit scc nfd-worker on the bastion node, change users: - system:serviceaccount::nfd-worker to users: - system:serviceaccount: <Name of the namespace>:nfd-worker Wait for some time for the nfd-worker pods to be in the 'Running' State. |
If multiple Data Replication plans for Disaster Recovery are created in parallel, the initial VVR synchronization is slow due to the locking that needs to be acquired in VVR for each of the Data Replication plan. (VIKE-1505) | To create multiple Data Replication plans, create the subsequent plans after the current Data Replication plan's initial synchronization is complete and status is consistent-up-to-date. |
When UEFI secure boot is enabled, only the signed kernel modules that are authenticated by using a key on the system key ring can be successfully loaded. Hence InfoScale kernel modules might fail to load with UEFI secure boot. (VIKE-1578) | Disable UEFI secure boot. See the relevant documentation for details. |
In the following conditions, addition of disks to a disk group in an already configured InfoScale cluster fails with the error message VxVM vxdg ERROR V-5-1-559 Disk node000_vmdk0_1: Name is already used
(VIKE-2598) | None |
If thin-provisioned LUNs are configured, a mismatch in the total size and free size of the Disk Group might be observed. (VIKE-2497) | None |
While data is being replicated, if the SRL buffer on the primary is full due to faster incoming writes on the primary volume; replication switches to DCM mode. This can be verified by running kubectl/oc get datareplication < datareplication name>. Output is similar to the following - Replication status: logging to DCM (needs dcm resynchronization) (VIKE-2600) | Run kubectl/oc edit datareplication < datareplication name> to edit the data replication custome resource and set 'replicationState: resync' .This resynchronizes the replication volume and transitions to the normal state. |
After DR takeover is completed and the old primary is up, VVR failback synchronization of the data is initiated from the new primary site to the new secondary site. This data synchronization might be slow or the system might be unresponsive resulting in the VVR failback to be incomplete. (VIKE-2634) |
|
Sometimes after a cluster fault, InfoScale nodes join the cluster back, infoscale-sds pods go in running state but the InfoScale disk group remains in a deported state.(VIKE-2666) | Login to any of the InfoScale-sds pods.
|
After InfoScale is deployed, cluster phase does not get updated and state is 'Degraded', even when it is Healthy. (VIKE-2586) | Restart InfoScale operator by deleting the pod. |
If stale Volumes are present on the system, undeploying InfoScale is unresponsive with SDS pods in a terminating state (VIKE-2635) | Reboot all worker nodes. You can reboot sequentially, if applications in the cluster might be impacted due to a simultaneous reboot of all nodes. |
After a cluster reboot, RVGShared group creation might fail with the error message - | Run the following steps
|
When InfoScale is installed on AKS, InfoScale sds pod does not come up after reboot and is in an Image pull backoff state.(VIKE-2623) | Run the following steps.
|
On an AKS cluster, the Disk goes in a Detached state after a node reboot. (VIKE-2589) | Run the following steps.
|
Sometimes a primary cluster fault event might occur after a PVC resize operation on the primary cluster, followed by a takeover operation carried out from secondary cluster.
In that case, when the old primary joins back the cluster membership, some Volumes on the old primary are in a | Run vxrecover -s from an infoscale-sds pod. The state changes to |
During upgrade, sometimes InfoScale sds pod remains in a terminating state. (VIKE-3003) | Reboot the node where the pod is stuck in a terminating state. |
InfoScale uses a third-party utility Cert-Manager to issue security certificates to InfoScale operators. Due to a Cert-Manager issue, security certificates are not issued with Operator log messages like | Review the InfoScale operator log message - |