NetBackup™ Snapshot Manager for Cloud Install and Upgrade Guide
- Introduction
- Section I. NetBackup Snapshot Manager for Cloud installation and configuration
- Preparing for NetBackup Snapshot Manager for Cloud installation
- Deploying NetBackup Snapshot Manager for Cloud using container images
- Deploying NetBackup Snapshot Manager for Cloud extensions
- Installing the NetBackup Snapshot Manager extension on a VM
- Installing the NetBackup Snapshot Manager extension on a managed Kubernetes cluster (AKS) in Azure
- Installing the NetBackup Snapshot Manager extension on a managed Kubernetes cluster (EKS) in AWS
- Installing the NetBackup Snapshot Manager extension on a managed Kubernetes cluster (GKE) in GCP
- NetBackup Snapshot Manager for cloud providers
- AWS plug-in configuration notes
- Google Cloud Platform plug-in configuration notes
- Prerequisites for configuring the GCP plug-in using Credential and Service Account option
- Microsoft Azure plug-in configuration notes
- Microsoft Azure Stack Hub plug-in configuration notes
- OCI plug-in configuration notes
- Configuration for protecting assets on cloud hosts/VM
- Protecting assets with NetBackup Snapshot Manager's on-host agent feature
- Installing and configuring NetBackup Snapshot Manager agent
- Configuring the NetBackup Snapshot Manager application plug-in
- Microsoft SQL plug-in
- Oracle plug-in
- Protecting assets with NetBackup Snapshot Manager's agentless feature
- Snapshot Manager for cloud catalog backup and recovery
- NetBackup Snapshot Manager for cloud assets protection
- Volume encryption in NetBackup Snapshot Manager for cloud
- NetBackup Snapshot Manager for Cloud security
- Preparing for NetBackup Snapshot Manager for Cloud installation
- Section II. NetBackup Snapshot Manager for Cloud maintenance
- NetBackup Snapshot Manager for Cloud logging
- Upgrading NetBackup Snapshot Manager for Cloud
- Migrating and upgrading NetBackup Snapshot Manager
- Post-upgrade tasks
- Uninstalling NetBackup Snapshot Manager for Cloud
- Troubleshooting NetBackup Snapshot Manager for Cloud
Troubleshooting NetBackup Snapshot Manager
Refer to the following troubleshooting scenarios:
NetBackup Snapshot Manager agent fails to connect to the NetBackup Snapshot Manager server if the agent host is restarted abruptly.
This issue may occur if the host where the NetBackup Snapshot Manager agent is installed is shut down abruptly. Even after the host restarts successfully, the agent fails to establish a connection with the NetBackup Snapshot Manager server and goes into an offline state.
The agent log file contains the following error:
Flexsnap-agent-onhost[4972] mainthread flexsnap.connectors.rabbitmq: error - channel 1 closed unexpectedly: (405) resource_locked - cannot obtain exclusive access to locked queue ' flexsnap-agent.a1f2ac945cd844e393c9876f347bd817' in vhost '/'
This issue occurs because the RabbitMQ connection between the agent and the NetBackup Snapshot Manager server does not close even in case of an abrupt shutdown of the agent host. The NetBackup Snapshot Manager server cannot detect the unavailability of the agent until the agent host misses the heartbeat poll. The RabbitMQ connection remains open until the next heartbeat cycle. If the agent host reboots before the next heartbeat poll is triggered, the agent tries to establish a new connection with the NetBackup Snapshot Manager server. However, as the earlier RabbitMQ connection already exists, the new connection attempt fails with a resource locked error.
As a result of this connection failure, the agent goes offline and leads to a failure of all snapshot and restore operations performed on the host.
Workaround:
Restart the Cohesity NetBackup Snapshot Manager Agent service on the agent host.
On a Linux hosts, run the following command:
# sudo systemctl restart flexsnap-agent.service
On Windows hosts:
Restart the
Cohesity NetBackup Snapshot Manager™ Agent
service from the Windows Services console.
NetBackup Snapshot Manager agent registration on Windows hosts may time out or fail.
For protecting applications on Windows, you need to install and then register the NetBackup Snapshot Manager agent on the Windows host. The agent registration may sometimes take longer than usual and may either time out or fail.
Workaround:
To resolve this issue, try the following steps:
Re-register the agent on the Windows host using a fresh token.
If the registration process fails again, restart the NetBackup Snapshot Manager services on the NetBackup Snapshot Manager server and then try registering the agent again.
Refer to the following for more information:
Disaster recovery when DR package is lost or passphrase is lost.
This issue may occur if the DR package is lost or the passphrase is lost.
In case of Catalog backup, 2 backup packages are created:
DR package which contains all the certs
Catalog package which contains the data base
The DR package contains the NetBackup UUID certs and Catalog DB also has the UUID. When you perform disaster recovery using the DR package followed by catalog recovery, both the UUID cert and the UUID are restored. This allows NetBackup to communicate with NetBackup Snapshot Manager since the UUID is not changed.
However if the DR package is lost or the Passphrase is lost the DR operation cannot be completed. You can only recover the catalog without DR package after you reinstall NetBackup. In this case, a new UUID is created for NetBackup which is not recognised by NetBackup Snapshot Manager. The one-to-one mapping of NetBackup and NetBackup Snapshot Manager is lost.
Workaround:
To resolve this issue, you must update the new NBU UUID and Version Number after NetBackup primary is created.
The NetBackup administrator must be logged on to the NetBackup Web Management Service to perform this task. Use the following command to log on:
/usr/openv/netbackup/bin/bpnbat -login -loginType WEB
Execute the following command on the primary server to get the NBU UUID:
/usr/openv/netbackup/bin/admincmd/nbhostmgmt -list -host <primary server host name> | grep "Host ID"
Execute the following command to get the Version Number:
/usr/openv/netbackup/bin/admincmd/bpgetconfig -g <primary Ssrver host name> -L
After you get the NBU UUID and Version number, execute the following command on the NetBackup Snapshot Manager host to update the mapping:
/cloudpoint/scripts/cp_update_nbuuid.sh -i <NBU UUID> -v <Version Number>
The snapshot job is successful but backup job fails with error "The NetBackup Snapshot Managers certificate is not valid or doesn't exist.(9866)" when ECA_CRL_CHECK disabled on master server.
If ECA_CRL_CHECK is configured on master server and is disabled then it must be configured in
bp.conf
on NetBackup Snapshot Manager setup with same value.For example, considering a scenario of backup from snapshot where NetBackup is configured with external certificate and certificate is revoked. In this case, if ECA_CRL_CHECK is set as DISABLE on master then set the same value in
bp.conf
of NetBackup Snapshot Manager setup, otherwise snapshot operation will be successful and backup operation will fail with the certificate error.NetBackup Snapshot Manager cloud operations fail on a RHEL system if a firewall is disabled
The NetBackup Snapshot Manager operations fail for all the supported cloud plugins on a RHEL system, if a firewall is disabled on that system when the NetBackup Snapshot Manager services are running. This is a network configuration issue that prevents the NetBackup Snapshot Manager from accessing the cloud provider REST API endpoints.
Workaround:
Stop NetBackup Snapshot Manager
flexsnap_configure stop
Restart Docker
# systemctl restart docker
Restart NetBackup Snapshot Manager
flexsnap_configure start
Backup from Snapshot job and Indexing job fails with the errors
Jun 10, 2021 2:17:48 PM - Error mqclient (pid=1054) SSL Connection failed with string, broker:<hostname> Jun 10, 2021 2:17:48 PM - Error mqclient (pid=1054) Failed SSL handshake, broker:<hostname> Jun 10, 2021 2:19:16 PM - Error nbcs (pid=29079) Invalid operation for asset: <asset_id> Jun 10, 2021 2:19:16 PM - Error nbcs (pid=29079) Acknowledgement not received for datamover <datamover_id>
and/or
Jun 10, 2021 3:06:13 PM - Critical bpbrm (pid=32373) from client <asset_id>: FTL - Cannot retrieve the exported snapshot details for the disk with UUID:<disk_asset_id> Jun 10, 2021 3:06:13 PM - Info bptm (pid=32582) waited for full buffer 1 times, delayed 220 times Jun 10, 2021 3:06:13 PM - Critical bpbrm (pid=32373) from client <asset_id>: FTL - cleanup() failed, status 6
This can happen when the inbound access to NetBackup Snapshot Manager on port 5671 and 443 port gets blocked at the OS firewall level (firewalld). Hence, from the datamover container (used for the Backup from Snapshot and Indexing jobs), communication to NetBackup Snapshot Manager gets blocked. This results in the datamover container not being able to start the backup or indexing.
Workaround:
Modify the rules in OS firewall to allow the inbound connection from 5671 and 443 port.
Agentless connection fails for a VM with an error message.
Agentless connection fails for a VM with the following error message when user changes the authentication type from SSH Key based to password based for a VM through the portal:
User does not have the required privileges to establish an agentless connection
This issue occurs when the permissions are not defined correctly for the user in the sudoers file as mentioned in the above error message.
Workaround:
Resolve the sudoers file issue for the user by providing the required permissions to perform the passwordless sudo operations.
When NetBackup Snapshot Manager is deployed in private subnet (without internet) NetBackup Snapshot Manager function fails
This issue occurs when NetBackup Snapshot Manager is deployed in private network where firewall is enabled or public IP which is disabled. The customer's information security team would not allow full internet access to the virtual machine's.
Workaround:
Enable the ports from the firewall command line using the following commands:
firewall-cmd --add-port=22/tcp
firewall-cmd --add-port=5671/tcp
firewall-cmd --add-port=443/tcp
Restoring asset from backup copy fails
In some of the scenarios it is observed that the connection resets intermittently in Docker container. Due to this the server sends more tcp payload than the advertised client window. Sometimes Docker container drops
packet from new TCP connection handshake. To allow these packets, use thenf_conntrack_tcp_be_liberal
option.If
nf_conntrack_tcp_be_liberal = 1
then the following packets are allowed:ACK is under the lower bound (possible overly delayed ACK)
ACK is over the upper bound (ACKed data not seen yet)
SEQ is under the lower bound (already ACKed data retransmitted)
SEQ is over the upper bound (over the window of the receiver)
If
nf_conntrack_tcp_be_liberal = 0
then those are also rejected as invalid.Workaround:
To resolve the issue of restore from backup copy, use the
nf_conntrack_tcp_be_liberal = 1
option and set this value on node where datamover container is running.Use the following command for setting the value of
nf_conntrack_tcp_be_liberal
:sysctl -w net.netfilter.nf_conntrack_tcp_be_liberal=1
Some pods on Kubernetes extension progressed to completed state
Workaround:
Disable Kubernetes extension.
Delete listener pod using the following command:
#kubectl delete pod flexnsap-listener-xxxxx -n <namespace>
Enable Kubernetes extension.
User is not able to customize a cloud protection plan
Workaround:
Create a new protection plan with the desired configuration and assign it to the asset.
Default timeout of 6 hours is not allowing restore of larger database (size more than 300 GB)
Workaround:
Configurable timeout parameter value can be set to restore larger database. The timeout value can be specified in
/etc/flexsnap.conf
file offlexsnap-coordinator
container. It does not require restart of the coordinator container. Timeout value would be picked up in next database restore job.User must specify the timeout value in seconds as follows:
docker exec -it flexsnap-coordinator bash root@flexsnap-coordinator:/# cat /etc/flexsnap.conf [global] target = flexsnap-rabbitmq grt_timeout = 39600
Agentless connection and granular restore to restored host fails when the VM restored from backup has 50 tags attached to it
Workaround:
(For AWS) If a Windows VM restored from backup has 50 tags and platform tag does not exists, user can remove any tag that is not required and add the
tag.For few GKE versions, failed pod issues are observed in namespace
Following few failed pods in namespace is observed with failure status as NodeAffinity:
$ kubectl get pods -n <cp_extension_namespace> NAME READY STATUS RESTARTS AGE flexsnap-datamover- 2fc2967943ba4ded8ef653318107f49c-664tm 0/1 Terminating 0 4d14h flexsnap-fluentd-collector-c88f8449c-5jkqh 0/1 NodeAffinity 0 3d15h flexsnap-fluentd-collector-c88f8449c-ph8mx 0/1 NodeAffinity 0 39h flexsnap-fluentd-collector-c88f8449c-rqw7w 1/1 Running 0 10h flexsnap-fluentd-collector-c88f8449c-sswzr 0/1 NodeAffinity 0 5d18h flexsnap-fluentd-ftlnv 1/1 Running 3 (10h ago)10h flexsnap-listener-84c66dd4b8-6l4zj 1/1 Running 0 10h flexsnap-listener-84c66dd4b8-ls4nb 0/1 NodeAffinity 0 17h flexsnap-listener-84c66dd4b8-x84q8 0/1 NodeAffinity 0 3d15h flexsnap-listener-84c66dd4b8-z7d5m 0/1 NodeAffinity 0 5d18h flexsnap-operator-6b7dd6c56c-cf4pc 1/1 Running 0 10h flexsnap-operator-6b7dd6c56c-qjsbs 0/1 NodeAffinity 0 5d18h flexsnap-operator-6b7dd6c56c-xcsgj 0/1 NodeAffinity 0 3d15h flexsnap-operator-6b7dd6c56c-z86tc 0/1 NodeAffinity 0 39h
However, these failures do not affect the functionality of NetBackup Snapshot Manager Kubernetes extension.
Workaround:
Manually clean-up the failed pods using the following command:
kubectl get pods -n <cp_extension_namespace> | grep NodeAffinity | awk '{print $1}' | xargs kubectl delete pod -n <cp_extension_namespace>
Plugin information is duplicated, if NetBackup Snapshot Manager registration has failed in previous attempts
This occurs only when NetBackup Snapshot Manager has been deployed using the MarketPlace Deployment Mechanism. This issue is observed when the plugin information is added before the registration. This issue creates duplicate plugin information in the
file.Workaround:
Manually delete the duplicated plugin information from the
file.For example, consider the following example where the duplicate entry for GCP plugin config is visible (in bold) in
file:{ "CPServer1": [ { "Plugin_ID": "test", "Plugin_Type": "aws", "Config_ID": "aws.8dda1bf5-5ead-4d05-912a-71bdc13f55c4", "Plugin_Category": "Cloud", "Disabled": false } ] }, { "CPServer2": [ { "Plugin_ID": "gcp.2080179d-c149-498a-bf1f-4c9d9a76d4dd", "Plugin_Type": "gcp", "Config_ID": "gcp.2080179d-c149-498a-bf1f-4c9d9a76d4dd", "Plugin_Category": "Cloud", "Disabled": false },
] }Plugin information is duplicated, if cloned NetBackup Snapshot Manager is added into NetBackup
This occurs only when cloned NetBackup Snapshot Manager is added into NetBackup during migration of NetBackup Snapshot Manager to RHEL 8.6 VM. Cloning of NetBackup Snapshot Manager uses existing NetBackup Snapshot Manager volume to create new NetBackup Snapshot Manager. This creates duplicate entry into
file.Workaround:
Manually edit and delete the duplicated plugin information from the
file.For example, consider the following example where the duplicate entry for Azure plugin config is visible (in bold) in
file:{
}, { "cpserver101.yogesh.joshi2-dns-zone": [ { "Plugin_ID": "azure.327ec7fc-7a2d-4e94-90a4-02769a2ba521", "Plugin_Type": "azure", "Config_ID": "azure.327ec7fc-7a2d-4e94-90a4-02769a2ba521", "Plugin_Category": "Cloud", "Disabled": false }, { "Plugin_ID": "AZURE_PLUGIN", "Plugin_Type": "azure", "Config_ID": "azure.4400a00a-8d2b-4985-854a-74f48cd4567e", "Plugin_Category": "Cloud", "Disabled": false } ] } ] }Backup from Snapshot operation using Snapshot Manager version 10.0 deployed in Azure fails due to SSL cert error
Backup from Snapshot operation using Snapshot Manager version 10.3 or later deployed in Azure fails due to SSL cert error related to CRL (curl).
Workaround:
Add ECA_CRL_CHECK = 0 in Snapshot Manager
bp.conf
file and ensure that Azure endpoints are accessible from media server.