Veritas InfoScale™ for Kubernetes Environments 8.0.200 - Linux

Last Published:
Product(s): InfoScale & Storage Foundation (8.0.200)
Platform: Linux
  1. Overview
    1.  
      Introduction
    2.  
      Features of InfoScale in Containerized environment
    3.  
      CSI Introduction
    4.  
      I/O fencing
    5.  
      Disaster Recovery
    6.  
      Licensing
    7.  
      Encryption
  2. System requirements
    1.  
      Introduction
    2.  
      Supported platforms
    3.  
      Disk space requirements
    4.  
      Hardware requirements
    5.  
      Number of nodes supported
    6.  
      DR support
  3. Preparing to install InfoScale on Containers
    1. Setting up the private network
      1.  
        Guidelines for setting the media speed for LLT interconnects
      2.  
        Guidelines for setting the maximum transmission unit (MTU) for LLT
    2.  
      Synchronizing time settings on cluster nodes
    3.  
      Securing your InfoScale deployment
    4.  
      Configuring kdump
  4. Installing Veritas InfoScale on OpenShift
    1.  
      Introduction
    2.  
      Prerequisites
    3.  
      Additional Prerequisites for Azure RedHat OpenShift (ARO)
    4.  
      Considerations for configuring cluster or adding nodes to an existing cluster
    5. Installing InfoScale on a system with Internet connectivity
      1. Installing from OperatorHub by using web console
        1.  
          Adding Nodes to an InfoScale cluster by using OLM
        2.  
          Undeploying and uninstalling InfoScale
      2. Installing from OperatorHub by using Command Line Interface (CLI)
        1.  
          Configuring cluster
        2.  
          Adding nodes to an existing cluster
        3.  
          Undeploying and uninstalling InfoScale by using CLI
      3. Installing by using YAML
        1.  
          Configuring cluster
        2.  
          Adding nodes to an existing cluster
        3.  
          Undeploying and uninstalling InfoScale
    6. Installing InfoScale in an air gapped system
      1.  
        Prerequisites to install by using YAML or OLM
      2.  
        Additional prerequisites to install by using yaml
      3.  
        Installing from OperatorHub by using web console
      4.  
        Installing from OperatorHub by using Command Line Interface (CLI)
      5.  
        Installing by using YAML
  5. Installing Veritas InfoScale on Kubernetes
    1.  
      Introduction
    2. Prerequisites
      1.  
        Installing Node Feature Discovery (NFD) Operator and Cert-Manager on Kubernetes
    3.  
      Installing the Special Resource Operator
    4. Tagging the InfoScale images on Kubernetes
      1.  
        Downloading side car images
    5.  
      Applying licenses
    6.  
      Tech Preview: Installing InfoScale on an Azure Kubernetes Service(AKS) cluster
    7.  
      Considerations for configuring cluster or adding nodes to an existing cluster
    8. Installing InfoScale on Kubernetes
      1.  
        Configuring cluster
      2.  
        Adding nodes to an existing cluster
    9.  
      Installing InfoScale by using the plugin
    10.  
      Undeploying and uninstalling InfoScale
  6. Configuring KMS-based Encryption on an OpenShift cluster
    1.  
      Introduction
    2.  
      Adding a custom CA certificate
    3.  
      Configuring InfoScale to enable transfer of keys
    4.  
      Enabling rekey for an encrypted Volume
  7. Configuring KMS-based Encryption on a Kubernetes cluster
    1.  
      Introduction
    2.  
      Adding a custom CA certificate
    3.  
      Configuring InfoScale to enable transfer of keys
    4.  
      Enabling rekey for an encrypted Volume
  8. InfoScale CSI deployment in Container environment
    1.  
      CSI plugin deployment
    2.  
      Raw block volume support
    3.  
      Static provisioning
    4. Dynamic provisioning
      1.  
        Reclaiming provisioned storage
    5.  
      Resizing Persistent Volumes (CSI volume expansion)
    6. Snapshot provisioning (Creating volume snapshots)
      1.  
        Dynamic provisioning of a snapshot
      2.  
        Static provisioning of an existing snapshot
      3.  
        Using a snapshot
      4.  
        Restoring a snapshot to new PVC
      5.  
        Deleting a volume snapshot
      6.  
        Creating snapshot of a raw block volume
    7. Managing InfoScale volume snapshots with Velero
      1.  
        Setting up Velero with InfoScale CSI
      2.  
        Taking the Velero backup
      3.  
        Creating a schedule for a backup
      4.  
        Restoring from the Velero backup
    8. Volume cloning
      1.  
        Creating volume clones
      2.  
        Deleting a volume clone
    9.  
      Using InfoScale with non-root containers
    10.  
      Using InfoScale in SELinux environments
    11.  
      CSI Drivers
    12.  
      Creating CSI Objects for OpenShift
  9. Installing and configuring InfoScale DR Manager on OpenShift
    1.  
      Introduction
    2.  
      Prerequisites
    3.  
      Creating Persistent Volume for metadata backup
    4.  
      External dependencies
    5. Installing InfoScale DR Manager by using OLM
      1.  
        Installing InfoScale DR Manager by using web console
      2.  
        Configuring InfoScale DR Manager by using web console
      3.  
        Installing from OperatorHub by using Command Line Interface (CLI)
    6. Installing InfoScale DR Manager by using YAML
      1.  
        Configuring Global Cluster Membership (GCM)
      2.  
        Configuring Data Replication
      3.  
        Additional requirements for replication on Cloud
      4.  
        Configuring DNS
      5.  
        Configuring Disaster Recovery Plan
  10. Installing and configuring InfoScale DR Manager on Kubernetes
    1.  
      Introduction
    2.  
      Prerequisites
    3.  
      Creating Persistent Volume for metadata backup
    4.  
      External dependencies
    5. Installing InfoScale DR Manager
      1.  
        Configuring Global Cluster Membership (GCM)
      2.  
        Configuring Data Replication
      3.  
        Additional requirements for replication on Cloud
      4.  
        Configuring DNS
      5.  
        Configuring Disaster Recovery Plan
  11. Disaster Recovery scenarios
    1.  
      Migration
    2.  
      Takeover
  12. Configuring InfoScale
    1.  
      Logging mechanism
    2.  
      Configuring Veritas Oracle Data Manager (VRTSodm)
    3.  
      Enabling user access and other pod-related logs in Container environment
  13. Administering InfoScale on Containers
    1.  
      Adding Storage to an InfoScale cluster
    2.  
      Managing licenses
  14. Upgrading InfoScale
    1.  
      Prerequisities
    2.  
      On a Kubernetes cluster
    3.  
      On an OpenShift cluster
  15. Troubleshooting
    1.  
      Collecting logs by using SORT Data Collector
    2.  
      Known Issues
    3.  
      Limitations

Known Issues

Following are the issues observed during testing in an internal test environment. The issues have remained unresolved and you might encounter these issues. The issue is described and if a workaround exists to resolve the issue, it is mentioned.

Note:

Workaround is a temporary solution to the issue. Veritas is working towards fixing the issue.

Table: Issue description and workaround

Description

Workaround

On a setup with multiple Network Interface Cards, deployment of SRO might fail as the underlying NFD worker pods are not reachable to the NFD master (4045909)

Set the appropriate IP_AUTODETECTION_METHOD to be used for Kubernetes communication if Calico CNI is used.

or

Keep only the valid Network Interface Card interfaces online

Deleting InfoScale pods by using oc delete pods or kubectl delete pods command might result in Configuration failure. (4045599)

To undeploy InfoScale, delete by using InfoScale CR procedure.

On a VMware Virtual Machine, deployment of InfoScale on OpenShift or Kubernetes fails if Disk UUID is not enabled. (4046388)

Enable Disk UUID before deployment. See Enabling disk UUID on virtual machines.

For space-optimized snapshots, if writes on volume are more than size of cache object size, a write error is observed. This leads to snapshot volume getting marked as INVALID after detaching the mirror. (4043239)

Use space-optimized snapshots only in cases where expected rate of data change is much smaller than actual data volume size.

When a file system is 100% full, and PVC resize is attempted, allocating space for the metadata or the config files required for file system resize might fail, causing PVC resize failure. (4045020)

Contact Veritas support for system recovery.

Deployment of pods with PVC which are restored from a snapshot or are cloned from another PVC and is initiated in ReadOnlyMany(ROM) access mode fails. Deployment goes into CreateContainerError state. (4040975)

Set the following deployment parameters to True - Pod.spec.volumes.persistentVolumeClaim.readOnly and Pod.spec.containers.volumeMounts[x].readOnly.

Disk initialization performed by using the vxdisksetup command fails with the following error message - VxVM vxdisksetup ERROR V-5-2-1120 node002_vmdk0_0: Disk is tagged as imported to a shared disk group. Can not proceed.. (4045033)

Ensure that the disk does not belong to any other diskgroup. If it does not belong to any other diskgroup, the disk might have some stale metadata. Run vxdiskunsetup on the disk and try disk initialization again.

A message 'File missing or empty: /boot/grub/menu.lst' is displayed even after a successful disk initialization. (4039351)

Ignore the message.

When majority of the nodes in a cluster go in a 'NotReady' state, fast failover (kube-fencing) panics nodes. InfoScale fencing panics rest of the nodes. Even after the cluster is back with majority of the nodes in a 'Ready' state, unfinished kube-fencing jobs continue to panic nodes. (4044408)

Manually delete the kube-fencing jobs till the cluster is up.

With a heavy workload, node goes in a 'NotReady' state and InfoScale pods are getting killed. (4044963)

OpenShift Container Platform (OCP) runs extra system pods which consume memory. With heavy workloads, pods are killed to clear memory. Try the following -

  • Place a resource cap on less important OCP system pods like Prometheus (OCP Monitoring service). See OpenShift documentation.

  • Set pod eviction thresholds and set Kube-reserved and System-reserved resources. Pods are evicted when resources available for the node fall below the limits specified. See OpenShift documentation.

  • Provision higher physical memory for the node.

When a back enclosure is disabled and enabled in a cvm-slave node, disk fails to attach back to the disk group. (4046928)

Run /etc/vx/bin/vxreattach.

After faulting a slave node, one or more volumes do not get mounted or existing volumes get unmounted inside application pod. (4044533)

Reschedule/restart the pod to mount the volume.

If a worker node is powered off or rebooted, the node goes into emergency shell and enters NotReady state, thus becoming inaccessible. (4053892)

Reinstall or reconfigure the control plane on the worker node. See OpenShift documentation.

After creating PVC in RWX mode, data written on an application pod running on one node is not accessible from an application pod scheduled on another node. (4046460)

See https://access.redhat.com/solutions/6153272 for the recommended solution.

In container form factor if public/private NICs to be used for LLT are bonded and the underlying bonded NICs have been configured on the same switch,then the worker nodes on which InfoScale is configured might panic randomly with a message - kernel BUG at mm/slub.c:305! . (4048786)

If NIC bonding is required for the LLT links, ensure that the underlying NICs are configured on different switches to avoid the kernel node panic, even though the crash has no functional impact. If private links are connected to the same switch, the bond mode must be Active-Backup

Disks from a node of the InfoScale cluster do not get added to the disk group - vrts_kube_dg. The following error message V-5-1-18986 sal_map_devices: da_online failed with error 142 for SAL disk <disk_name> is logged in syslog on the master node. On running kubectl describe infoscalecluster -n infoscale-vtas, Output indicates disk addition failure. (4055278)

Add these disks to the disk group manually from the infoscale-vtas-driver-container pod.

During InfoScale deployment, InfoScale configuration is not complete on one of the nodes and the node remains in a 'Not ready' state (4047598)

Remove cr.yaml and deploy InfoScale again by using cr.yaml.

If InfoScale is undeployed on all nodes while retaining the disk groups, re-creating InfoScale cluster on some nodes and adding nodes to the cluster fails. (4047205)

Undeploy and re-create InfoScale cluster on identical number of nodes. To undeploy InfoScale on all nodes and re-create InfoScale clusters on some nodes, contact Veritas support.

After restoring space-optimized snapshot to new PVC, mount on restored PVC may fail if the source snapshot volume is detached (4012858)

Try one of the following -

  • Use CSI space-optimized snapshot functionality for read-intensive applications.

  • Use full-instant snapshot or CSI clone functionality for write-intensive or read-write-update applications.

  • Manually set the values of the configurable parameters like cachesize to an appropriate value based on the application workload while creating CSI volumesnapshotclass object

CSI controller pod remains in 'Terminating' state in case of graceful node shutdown or power-off (4011482)

Try one of the following -

  • If a node must be kept shut down for certain period, to ensure availability, use the following command to drain the node before shutting it down: kubectl drain <node_name> --force --ignore-daemonsets--delete-local-data

  • f you intend to delete the node from the Kubernetes cluster, delete the node object. In such case, you need not drain the node manually.

CSI node pods does not get rescheduled on other worker nodes when its parent node is drained (4011384)

None

While restoring a snapshot, PVC goes into pending state after rebooting all nodes except the master node in the cluster (4014525, 4048825)

Delete the snapshot volume by using the vxedit command. Kubernetes automatically reattempts to create a volume snapshot again.

In case of storage failure, application IOs to the mountpoint inside container fails and pod goes into CreateContainerConfigError or Error state (4011219, 4014758, 4015259)

Manually restart the application pod after the storage failure is resolved.

Current application on the primary must not be deleted until it is clear that DR is possible. In some cases, DR fails and application gets deleted on the primary.(4047475)

Ensure that the peer clusters are connected and Data Replication is in a healthy state.

If all worker nodes go down at the same time, InfoScale availability configuration is lost (4050355)

After recovery, InfoScale configuration is re-created. It might take up to 20 minutes.

If applications on a cluster with Load Balancer configuration are migrated, Load balancer service appears in 'Pending' state if the target cluster's Load balancer IP addresses are different. (4051429)

If you are using Load Balancer service, use DNS custom resources to manage DNS endpoint mapping.

Delete datarep operation goes in an unresponsive state and force delete in CR fails. (4050857)

Complete the following steps to clean up

  1. Check DR controller logs to check which cleanup part is failing.

  2. Delete DataReplication Custom Resource (CR) on all clusters by using kubectl or oc edit datarep <name> command and removing finalizer string infoscale.veritas.com.datareplication/finalizer.

  3. Login to InfoScale cluster pod infoscale-vtas-driver-container-* on all clusters

  4. Complete the following steps for Veritas Volume Replicator (VVR) objects cleanup

    • Stop replication for relevant RVG

    • Delete secondary

    • Delete primary

    • Delete corresponding SRL volume

  5. Complete the following steps for Veritas Cluster Server (VCS) objects cleanup

    • Change cluster operation to RW

    • Offline VIPgroup and RVGShared service groups corresponding to Datareplication CR (service group names are shown in CR status)

    • Delete resources available in these service. groups

    • Delete service groups' dependencies if any

    • Delete VIPgroup and RVGShared service groups

    • Change cluster operation to RO

See Veritas Volume Replicator and Veritas Cluster Server documentation for details.

In case DR migrate fails to complete and running the command kubectl describe datareplication.infoscale.veritas.com/<datarep_name> returns a message vradmin migrate command failed. (4053632)

Complete the following steps

  1. Run the command to know the RVG name kubectl get datareplications.infoscale.veritas.com <application Data Replication name>

  2. Login to one of the Infoscale driver containers.

  3. Run the following command vxprint -g vrts_kube_dg <RVG name >

  4. Review the output. If tutil is set to 'CONVERTING' , run the following command vxedit -g vrts_kube_dg -f set tutil0="" <RVG name>

tutil is cleared and DR migration completes

Kube-fencing is not functional when REST service and InfoScale operator pod is not running. (4054545)

None

In VVR environment, vradmin migrate might fail with the following error message

 VxVM VVR vxrvg ERROR V-5-1-1617 giving up:
 utility fields must be cleared by
 executing:
vxedit -f set tutil0="" <rvg>

(4057713)

Run the following command - vxedit -f set tutil0="" <rvg> to clear tutil on the RVG and retry the 'vradmin migrate' operation.

During cluster configuration or while adding new nodes on OpenShift or Kubernetes, node join might fail if the disks in the cluster have old disk group records from previous deployments. Output similar to the following indicates old disk group records.

vxvm:vxconfigd[238047]: V-5-1-11092 
   cleanup_client: (Disk in use by another 
      cluster) 223
esxd05vm06 kernel: VxVM vxio V-5-0-164 
     Failed to join cluster 
     infoscale_22670, aborting :

(4057178)

Reset the cluster ID on the disks of joiner node, in order to allow the node to join.

  1. Check cluster ID on existing/already-joined nodes in cluster: Run

    vxdisk list node000_vmdk0_9 | grep -i cluster

    Cluster ID is returned in the output.

  2. If the node join fails, verify if cluster ID is different on the joiner node using same command: Run

    vxdisk list node001_vmdk0_5 | grep -i cluster

    A different Cluster ID is returned in the output.

  3. Change the cluster ID to match it with existing nodes in cluster.

    /etc/vx/diag.d/vxprivutil set /dev/vx/dmp/node001_vmdk0_5 hostid=<Cluster ID returned in the first command>

    Ensure that disks belong to same disk group. Consult Veritas Technical support.

Stale InfoScale kernel modules might be left over after undeploying InfoScale cluster on OpenShift or Kubernetes.(4042642)

Before deploying InfoScale on OpenShift or Kubernetes, check if any stale InfoScale kernel modules (vxio/vxdmp/veki/vxspec/vxfs/odm/glm/gms) are loaded. If stale modules from old deployments are still loaded, reboot all worker nodes and then proceed with the InfoScale deployment.

LLT tools is unable to detect duplicate InfoScale cluster id on OpenShift.(4057800)

Re-deploy InfoScale CR to avoid duplicate cluster id match.

RLINK detach is observed while replication autosync is in progress between the primary and secondary clusters. (VIKE-1290)

Run kubectl describe datarep datareplicationName to check datareplication status. If ReplicationStatus is stopped, Edit datareplication and set attributes force: true and replicationState: stop in remoteClusterDetails. Again, edit datareplication and set attributes force: false and replicationState: start in remoteClusterDetails.

Failed to resize volume post migration, if storage class consumed by application pod's pvc is not present on secondary cluster - the new primary. (VIKE-1385)

Create Storage Class (SC) manually on the new primary cluster, which must be same as the old primary cluster.

On a Kubernetes cluster, tcp traffic reaches the node but does not get forwarded to the pod. (VIKE-1108)

Disable tx and rx offload on all nodes for calico tunnel device.

Migration or takeover fails if the namespace being backed up pre-exists on the target cluster with different SCC (Security context constraints)- related annotations like uid-range, supplemental-groups, and seLinuxOptions.(VIKE-1277)

None

Takeover operation fails if it is initiated when 'Disaster Recovery plan' is configured before 'Data replication configuration' is complete on the primary cluster. (VIKE-1294)

Wait for the following error in Disaster Recovery Plan CR status to go away before attempting takeover - metadataSyncStatus: Metadata backup transfer failed for all peer clusters

or

Wait for data replication CR status to be consistent up-to-date before applying Disaster Recovery plan CR as mentioned in 'Configuring Data Replication'.

On OCP 4.9, nfd-worker pods are not getting created. (VIKE-909)

Run oc edit scc nfd-worker on the bastion node, change

 
users:
- system:serviceaccount::nfd-worker

to

 
users:
- system:serviceaccount:
  <Name of the namespace>:nfd-worker

Wait for some time for the nfd-worker pods to be in the 'Running' State.

If multiple Data Replication plans for Disaster Recovery are created in parallel, the initial VVR synchronization is slow due to the locking that needs to be acquired in VVR for each of the Data Replication plan. (VIKE-1505)

To create multiple Data Replication plans, create the subsequent plans after the current Data Replication plan's initial synchronization is complete and status is consistent-up-to-date.

When UEFI secure boot is enabled, only the signed kernel modules that are authenticated by using a key on the system key ring can be successfully loaded. Hence InfoScale kernel modules might fail to load with UEFI secure boot. (VIKE-1578)

Disable UEFI secure boot. See the relevant documentation for details.

In the following conditions, addition of disks to a disk group in an already configured InfoScale cluster fails with the error message

VxVM vxdg ERROR V-5-1-559 
Disk node000_vmdk0_1: Name is already used
  1. Existing disk group is imported during cluster creation

  2. Additional storage is presented to the cluster

  3. Sequence of the node name in the CR is changed.

(VIKE-2598)

None

If thin-provisioned LUNs are configured, a mismatch in the total size and free size of the Disk Group might be observed. (VIKE-2497)

None

While data is being replicated, if the SRL buffer on the primary is full due to faster incoming writes on the primary volume; replication switches to DCM mode. This can be verified by running kubectl/oc get datareplication < datareplication name>. Output is similar to the following -

Replication status: logging to
 DCM (needs dcm resynchronization)

(VIKE-2600)

Run kubectl/oc edit datareplication < datareplication name> to edit the data replication custome resource and set 'replicationState: resync' .This resynchronizes the replication volume and transitions to the normal state.

After DR takeover is completed and the old primary is up, VVR failback synchronization of the data is initiated from the new primary site to the new secondary site. This data synchronization might be slow or the system might be unresponsive resulting in the VVR failback to be incomplete. (VIKE-2634)

  1. Pause all data replications on the primary site by setting the replicationState to pause in the data replication custom resources. Perform a sequential VVR failback synchronization operation then by setting the replicationState to resume on only one of the data replication custom resources, that is the active data replication.

  2. After the VVR failback synchronization is complete on the only active data replication, resume VVR failback synchronization on the next paused data replication. Complete VVR failback synchronization on all the configured data replications.

Sometimes after a cluster fault, InfoScale nodes join the cluster back, infoscale-sds pods go in running state but the InfoScale disk group remains in a deported state.(VIKE-2666)

Login to any of the InfoScale-sds pods.

  • Run /opt/VRTSvcs/bin/hagrp -clear DISK_GROUP to clear the fault on DISK_GROUP and /opt/VRTSvcs/bin/hagrp -online DISK_GROUP -any to get DISK_GROUP online.

  • If the service group fails to be online, run vxdg -s import <diskgroup_name> to manually import the disk group.

  • If the disk group import fails with an error message quorum lost, reboot all InfoScale nodes in the cluster.

After InfoScale is deployed, cluster phase does not get updated and state is 'Degraded', even when it is Healthy. (VIKE-2586)

Restart InfoScale operator by deleting the pod.

If stale Volumes are present on the system, undeploying InfoScale is unresponsive with SDS pods in a terminating state (VIKE-2635)

Reboot all worker nodes. You can reboot sequentially, if applications in the cluster might be impacted due to a simultaneous reboot of all nodes.

After a cluster reboot, RVGShared group creation might fail with the error message - Primary or Secondary IP not available or vradmind not running in the data replication Custom Resource status. (VIKE-2649)

Run the following steps

  1. Check the Last Refreshed status of this CR on the peer cluster.

  2. If not updated, run kubectl/oc exec -it <infoscale sds pod> -n infoscale-vtas - bash to login

  3. Run /opt/VRTSvcs/bin/hagrp -offline <VIP service group> to get it offline.

  4. Check the Last Refreshed status of this CR on the peer cluster again after a few minutes. It is updated.

When InfoScale is installed on AKS, InfoScale sds pod does not come up after reboot and is in an Image pull backoff state.(VIKE-2623)

Run the following steps.

  1. Check the kernel version on all AKS nodes that are a part of the InfoScale cluster.

  2. In case of a mismatch in the kernel version, reboot all nodes.

  3. After reboot, if the mismatch continues; Contact Microsoft support.

  4. If the new version is not an InfoScale supported version, contact Veritas support to enable support of this version.

On an AKS cluster, the Disk goes in a Detached state after a node reboot. (VIKE-2589)

Run the following steps.

  1. When the node is up after a reboot and joins the cluster , login to the driver container pod running on that node.

  2. Identify the disk which in a detached state.

  3. Run /etc/vx/bin/vxreattach <diskname >

  4. Run vxdisk list to check whether the disk is attached.