Search <book_title>...

Veritas InfoScale™ for Kubernetes Environments 8.0.220 - Linux

Last Published: 2023-10-16

Product(s): InfoScale & Storage Foundation (8.0.220)

Platform: Linux

Known Issues

Following are the issues observed during testing in an internal test environment. The issues have remained unresolved and you might encounter these issues. The issue is described and if a workaround exists to resolve the issue, it is mentioned.

Note:

Workaround is a temporary solution to the issue. Veritas is working towards fixing the issue.

Table: Issue description and workaround

Description	Workaround
On a setup with multiple Network Interface Cards, deployment of SRO might fail as the underlying NFD worker pods are not reachable to the NFD master (4045909)	Set the appropriate `IP_AUTODETECTION_METHOD` to be used for Kubernetes communication if Calico CNI is used. or Keep only the valid Network Interface Card interfaces online
Deleting InfoScale pods by using oc delete pods or kubectl delete pods command might result in Configuration failure. (4045599)	To undeploy InfoScale, delete by using InfoScale CR procedure.
On a VMware Virtual Machine, deployment of InfoScale on OpenShift or Kubernetes fails if Disk UUID is not enabled. (4046388)	Enable Disk UUID before deployment. See Enabling disk UUID on virtual machines.
For space-optimized snapshots, if writes on volume are more than size of cache object size, a write error is observed. This leads to snapshot volume getting marked as INVALID after detaching the mirror. (4043239)	Use space-optimized snapshots only in cases where expected rate of data change is much smaller than actual data volume size.
When a file system is 100% full, and PVC resize is attempted, allocating space for the metadata or the config files required for file system resize might fail, causing PVC resize failure. (4045020)	Contact Veritas support for system recovery.
Deployment of pods with PVC which are restored from a snapshot or are cloned from another PVC and is initiated in ReadOnlyMany(ROM) access mode fails. Deployment goes into CreateContainerError state. (4040975)	Set the following deployment parameters to `True` - Pod.spec.volumes.persistentVolumeClaim.readOnly and Pod.spec.containers.volumeMounts[x].readOnly.
Disk initialization performed by using the vxdisksetup command fails with the following error message -`VxVM vxdisksetup ERROR V-5-2-1120 node002_vmdk0_0: Disk is tagged as imported to a shared disk group. Can not proceed.`. (4045033)	Ensure that the disk does not belong to any other diskgroup. If it does not belong to any other diskgroup, the disk might have some stale metadata. Run vxdiskunsetup on the disk and try disk initialization again.
A message 'File missing or empty: /boot/grub/menu.lst' is displayed even after a successful disk initialization. (4039351)	Ignore the message.
When majority of the nodes in a cluster go in a 'NotReady' state, fast failover (kube-fencing) panics nodes. InfoScale fencing panics rest of the nodes. Even after the cluster is back with majority of the nodes in a 'Ready' state, unfinished kube-fencing jobs continue to panic nodes. (4044408)	Manually delete the kube-fencing jobs till the cluster is up.
With a heavy workload, node goes in a 'NotReady' state and InfoScale pods are getting killed. (4044963)	OpenShift Container Platform (OCP) runs extra system pods which consume memory. With heavy workloads, pods are killed to clear memory. Try the following - Place a resource cap on less important OCP system pods like Prometheus (OCP Monitoring service). See OpenShift documentation. Set pod eviction thresholds and set Kube-reserved and System-reserved resources. Pods are evicted when resources available for the node fall below the limits specified. See OpenShift documentation. Provision higher physical memory for the node.
When a back enclosure is disabled and enabled in a cvm-slave node, disk fails to attach back to the disk group. (4046928)	Run /etc/vx/bin/vxreattach.
After faulting a slave node, one or more volumes do not get mounted or existing volumes get unmounted inside application pod. (4044533)	Reschedule/restart the pod to mount the volume.
If a worker node is powered off or rebooted, the node goes into emergency shell and enters NotReady state, thus becoming inaccessible. (4053892)	Reinstall or reconfigure the control plane on the worker node. See OpenShift documentation.
After creating PVC in RWX mode, data written on an application pod running on one node is not accessible from an application pod scheduled on another node. (4046460)	See https://access.redhat.com/solutions/6153272 for the recommended solution.
In container form factor if public/private NICs to be used for LLT are bonded and the underlying bonded NICs have been configured on the same switch,then the worker nodes on which InfoScale is configured might panic randomly with a message - kernel BUG at mm/slub.c:305! . (4048786)	If NIC bonding is required for the LLT links, ensure that the underlying NICs are configured on different switches to avoid the kernel node panic, even though the crash has no functional impact. If private links are connected to the same switch, the bond mode must be Active-Backup
Disks from a node of the InfoScale cluster do not get added to the disk group - `vrts_kube_dg`. The following error message V-5-1-18986 sal_map_devices: da_online failed with error 142 for SAL disk <disk_name> is logged in `syslog` on the master node. On running kubectl describe infoscalecluster -n infoscale-vtas, Output indicates disk addition failure. (4055278)	Add these disks to the disk group manually from the `infoscale-vtas-driver-container` pod.
During InfoScale deployment, InfoScale configuration is not complete on one of the nodes and the node remains in a 'Not ready' state (4047598)	Remove `cr.yaml` and deploy InfoScale again by using `cr.yaml`.
If InfoScale is undeployed on all nodes while retaining the disk groups, re-creating InfoScale cluster on some nodes and adding nodes to the cluster fails. (4047205)	Undeploy and re-create InfoScale cluster on identical number of nodes. To undeploy InfoScale on all nodes and re-create InfoScale clusters on some nodes, contact Veritas support.
After restoring space-optimized snapshot to new PVC, mount on restored PVC may fail if the source snapshot volume is detached (4012858)	Try one of the following - Use CSI space-optimized snapshot functionality for read-intensive applications. Use full-instant snapshot or CSI clone functionality for write-intensive or read-write-update applications. Manually set the values of the configurable parameters like cachesize to an appropriate value based on the application workload while creating CSI volumesnapshotclass object
CSI controller pod remains in 'Terminating' state in case of graceful node shutdown or power-off (4011482)	Try one of the following - If a node must be kept shut down for certain period, to ensure availability, use the following command to drain the node before shutting it down: kubectl drain <node_name> --force --ignore-daemonsets--delete-local-data f you intend to delete the node from the Kubernetes cluster, delete the node object. In such case, you need not drain the node manually.
CSI node pods does not get rescheduled on other worker nodes when its parent node is drained (4011384)	None
While restoring a snapshot, PVC goes into pending state after rebooting all nodes except the master node in the cluster (4014525, 4048825)	Delete the snapshot volume by using the vxedit command. Kubernetes automatically reattempts to create a volume snapshot again.
In case of storage failure, application IOs to the mountpoint inside container fails and pod goes into CreateContainerConfigError or Error state (4011219, 4014758, 4015259)	Manually restart the application pod after the storage failure is resolved.
Current application on the primary must not be deleted until it is clear that DR is possible. In some cases, DR fails and application gets deleted on the primary.(4047475)	Ensure that the peer clusters are connected and Data Replication is in a healthy state.
If all worker nodes go down at the same time, InfoScale availability configuration is lost (4050355)	After recovery, InfoScale configuration is re-created. It might take up to 20 minutes.
If applications on a cluster with Load Balancer configuration are migrated, Load balancer service appears in 'Pending' state if the target cluster's Load balancer IP addresses are different. (4051429)	If you are using Load Balancer service, use DNS custom resources to manage DNS endpoint mapping.
Delete datarep operation goes in an unresponsive state and force delete in CR fails. (4050857)	Complete the following steps to clean up Check DR controller logs to check which cleanup part is failing. Delete DataReplication Custom Resource (CR) on all clusters by using kubectl or oc edit datarep <name> command and removing finalizer string `infoscale.veritas.com.datareplication/finalizer`. Login to InfoScale cluster pod infoscale-vtas-driver-container-* on all clusters Complete the following steps for Veritas Volume Replicator (VVR) objects cleanup Stop replication for relevant RVG Delete secondary Delete primary Delete corresponding SRL volume Complete the following steps for Veritas Cluster Server (VCS) objects cleanup Change cluster operation to RW Offline VIPgroup and RVGShared service groups corresponding to Datareplication CR (service group names are shown in CR status) Delete resources available in these service. groups Delete service groups' dependencies if any Delete VIPgroup and RVGShared service groups Change cluster operation to RO See Veritas Volume Replicator and Veritas Cluster Server documentation for details.
In case DR migrate fails to complete and running the command kubectl describe datareplication.infoscale.veritas.com/<datarep_name> returns a message `vradmin migrate command failed`. (4053632)	Complete the following steps Run the command to know the RVG name kubectl get datareplications.infoscale.veritas.com <application Data Replication name> Login to one of the Infoscale driver containers. Run the following command vxprint -g vrts_kube_dg <RVG name > Review the output. If tutil is set to 'CONVERTING' , run the following command vxedit -g vrts_kube_dg -f set tutil0="" <RVG name> tutil is cleared and DR migration completes
Kube-fencing is not functional when REST service and InfoScale operator pod is not running. (4054545)	None
In VVR environment, vradmin migrate might fail with the following error message VxVM VVR vxrvg ERROR V-5-1-1617 giving up: utility fields must be cleared by executing: vxedit -f set tutil0="" <rvg> (4057713)	Run the following command - vxedit -f set tutil0="" <rvg> to clear `tutil` on the RVG and retry the 'vradmin migrate' operation.
During cluster configuration or while adding new nodes on OpenShift or Kubernetes, node join might fail if the disks in the cluster have old disk group records from previous deployments. Output similar to the following indicates old disk group records. vxvm:vxconfigd[238047]: V-5-1-11092 cleanup_client: (Disk in use by another cluster) 223 esxd05vm06 kernel: VxVM vxio V-5-0-164 Failed to join cluster infoscale_22670, aborting : (4057178)	Reset the cluster ID on the disks of joiner node, in order to allow the node to join. Check cluster ID on existing/already-joined nodes in cluster: Run vxdisk list node000_vmdk0_9 \| grep -i cluster Cluster ID is returned in the output. If the node join fails, verify if cluster ID is different on the joiner node using same command: Run vxdisk list node001_vmdk0_5 \| grep -i cluster A different Cluster ID is returned in the output. Change the cluster ID to match it with existing nodes in cluster. /etc/vx/diag.d/vxprivutil set /dev/vx/dmp/node001_vmdk0_5 hostid=<Cluster ID returned in the first command> Ensure that disks belong to same disk group. Consult Veritas Technical support.
Stale InfoScale kernel modules might be left over after undeploying InfoScale cluster on OpenShift or Kubernetes.(4042642)	Before deploying InfoScale on OpenShift or Kubernetes, check if any stale InfoScale kernel modules (`vxio/vxdmp/veki/vxspec/vxfs/odm/glm/gms`) are loaded. If stale modules from old deployments are still loaded, reboot all worker nodes and then proceed with the InfoScale deployment.
LLT tools is unable to detect duplicate InfoScale cluster id on OpenShift.(4057800)	Re-deploy InfoScale CR to avoid duplicate cluster id match.
RLINK detach is observed while replication autosync is in progress between the primary and secondary clusters. (VIKE-1290)	Run kubectl describe datarep datareplicationName to check datareplication status. If `ReplicationStatus` is stopped, Edit `datareplication` and set attributes `force: true` and `replicationState: stop` in remoteClusterDetails. Again, edit `datareplication` and set attributes `force: false` and `replicationState: start` in remoteClusterDetails.
Failed to resize volume post migration, if storage class consumed by application pod's pvc is not present on secondary cluster - the new primary. (VIKE-1385)	Create Storage Class (SC) manually on the new primary cluster, which must be same as the old primary cluster.
On a Kubernetes cluster, tcp traffic reaches the node but does not get forwarded to the pod. (VIKE-1108)	Disable tx and rx offload on all nodes for calico tunnel device.
Migration or takeover fails if the namespace being backed up pre-exists on the target cluster with different SCC (Security context constraints)- related annotations like uid-range, supplemental-groups, and seLinuxOptions.(VIKE-1277)	None
Takeover operation fails if it is initiated when 'Disaster Recovery plan' is configured before 'Data replication configuration' is complete on the primary cluster. (VIKE-1294)	Wait for the following error in Disaster Recovery Plan CR status to go away before attempting takeover - metadataSyncStatus: Metadata backup transfer failed for all peer clusters or Wait for data replication CR status to be consistent up-to-date before applying Disaster Recovery plan CR as mentioned in 'Configuring Data Replication'.
On OCP 4.9, `nfd-worker` pods are not getting created. (VIKE-909)	Run oc edit scc nfd-worker on the bastion node, change users: - system:serviceaccount::nfd-worker to users: - system:serviceaccount: <Name of the namespace>:nfd-worker Wait for some time for the nfd-worker pods to be in the 'Running' State.
If multiple Data Replication plans for Disaster Recovery are created in parallel, the initial VVR synchronization is slow due to the locking that needs to be acquired in VVR for each of the Data Replication plan. (VIKE-1505)	To create multiple Data Replication plans, create the subsequent plans after the current Data Replication plan's initial synchronization is complete and status is consistent-up-to-date.
When UEFI secure boot is enabled, only the signed kernel modules that are authenticated by using a key on the system key ring can be successfully loaded. Hence InfoScale kernel modules might fail to load with UEFI secure boot. (VIKE-1578)	Disable UEFI secure boot. See the relevant documentation for details.
In the following conditions, addition of disks to a disk group in an already configured InfoScale cluster fails with the error message VxVM vxdg ERROR V-5-1-559 Disk node000_vmdk0_1: Name is already used Existing disk group is imported during cluster creation Additional storage is presented to the cluster Sequence of the node name in the CR is changed. (VIKE-2598)	None
If thin-provisioned LUNs are configured, a mismatch in the total size and free size of the Disk Group might be observed. (VIKE-2497)	None
While data is being replicated, if the SRL buffer on the primary is full due to faster incoming writes on the primary volume; replication switches to DCM mode. This can be verified by running kubectl/oc get datareplication < datareplication name>. Output is similar to the following - Replication status: logging to DCM (needs dcm resynchronization) (VIKE-2600)	Run kubectl/oc edit datareplication < datareplication name> to edit the data replication custome resource and set 'replicationState: resync' .This resynchronizes the replication volume and transitions to the normal state.
After DR takeover is completed and the old primary is up, VVR failback synchronization of the data is initiated from the new primary site to the new secondary site. This data synchronization might be slow or the system might be unresponsive resulting in the VVR failback to be incomplete. (VIKE-2634)	Pause all data replications on the primary site by setting the `replicationState` to `pause` in the data replication custom resources. Perform a sequential VVR failback synchronization operation then by setting the `replicationState` to `resume` on only one of the data replication custom resources, that is the active data replication. After the VVR failback synchronization is complete on the only active data replication, resume VVR failback synchronization on the next paused data replication. Complete VVR failback synchronization on all the configured data replications.
Sometimes after a cluster fault, InfoScale nodes join the cluster back, infoscale-sds pods go in running state but the InfoScale disk group remains in a deported state.(VIKE-2666)	Login to any of the InfoScale-sds pods. Run /opt/VRTSvcs/bin/hagrp -clear DISK_GROUP to clear the fault on `DISK_GROUP` and /opt/VRTSvcs/bin/hagrp -online DISK_GROUP -any to get `DISK_GROUP` online. If the service group fails to be online, run vxdg -s import <diskgroup_name> to manually import the disk group. If the disk group import fails with an error message `quorum lost`, reboot all InfoScale nodes in the cluster.
After InfoScale is deployed, cluster phase does not get updated and state is 'Degraded', even when it is Healthy. (VIKE-2586)	Restart InfoScale operator by deleting the pod.
If stale Volumes are present on the system, undeploying InfoScale is unresponsive with SDS pods in a terminating state (VIKE-2635)	Reboot all worker nodes. You can reboot sequentially, if applications in the cluster might be impacted due to a simultaneous reboot of all nodes.
After a cluster reboot, RVGShared group creation might fail with the error message - `Primary or Secondary IP not available or vradmind not running` in the data replication Custom Resource status. (VIKE-2649)	Run the following steps Check the Last Refreshed status of this CR on the peer cluster. If not updated, run kubectl/oc exec -it <infoscale sds pod> -n infoscale-vtas - bash to login Run /opt/VRTSvcs/bin/hagrp -offline <VIP service group> to get it offline. Check the Last Refreshed status of this CR on the peer cluster again after a few minutes. It is updated.
When InfoScale is installed on AKS, InfoScale sds pod does not come up after reboot and is in an Image pull backoff state.(VIKE-2623)	Run the following steps. Check the kernel version on all AKS nodes that are a part of the InfoScale cluster. In case of a mismatch in the kernel version, reboot all nodes. After reboot, if the mismatch continues; Contact Microsoft support. If the new version is not an InfoScale supported version, contact Veritas support to enable support of this version.
On an AKS cluster, the Disk goes in a Detached state after a node reboot. (VIKE-2589)	Run the following steps. When the node is up after a reboot and joins the cluster , login to the driver container pod running on that node. Identify the disk which in a detached state. Run /etc/vx/bin/vxreattach <diskname > Run vxdisk list to check whether the disk is attached.
Sometimes a primary cluster fault event might occur after a PVC resize operation on the primary cluster, followed by a takeover operation carried out from secondary cluster. In that case, when the old primary joins back the cluster membership, some Volumes on the old primary are in a `NEEDSYNC` state. The `NEEDSYNC` state is indicated by running vxprint on the old primary infoscale-sds pod. (VIKE-2917)	Run vxrecover -s from an infoscale-sds pod. The state changes to `Active`.
During upgrade, sometimes InfoScale sds pod remains in a terminating state. (VIKE-3003)	Reboot the node where the pod is stuck in a terminating state.
InfoScale uses a third-party utility Cert-Manager to issue security certificates to InfoScale operators. Due to a Cert-Manager issue, security certificates are not issued with Operator log messages like `unable to create required certificates and issuer resources` and `certificate has expired or is not yet valid: current time <time stamp 1> is before <time stamp 2>`. Cert-Manager command `cmctl check api` returns error messages like `Not ready: Internal error occurred` and `no endpoints available for service cert-manager-webhook` (VIKE-2916)	Review the InfoScale operator log message - `certificate has expired or is not yet valid: current time <time stamp 1> is before <time stamp 2>`. Wait till time mentioned in `<time stamp 2>`. Cert-Manager issues certificates after that time to the InfoScale operators.