NBU status: 2106, EMM status: Storage Server is down or unavailable, new jobs on all media servers fail
Problem:
The NetBackup EMM process on the master server is completely unresponsive, either from the NetBackup-Java Administration Console, or via the command line when querying nbemmcmd, or from any other NetBackup process.
Active backup jobs and new backup jobs may fail with status 2106.
Backup jobs that went active in the seconds just before the status 2106 will fail with status 47.
Error Message:
NetBackup status: 2106, EMM status: Storage Server is down or unavailable
NetBackup status: 47, host is unreachable
Cause:
There's a condition in NetBackup where the NetBackup Remote Manager and Monitor Service (nbrmms) on any media server may make enough connections to the nbemm process on the master server to prevent the latter process from accepting new connections. This happens if the NetBackup Legacy Network Service (vnetd) is stopped, or fails, on the media server. During this condition all Storage Servers may be marked down and command line communication (nbemmcmd) may hang or stall. All active backup jobs may fail as will the jobs that retry during the condition.
Another form of the problem can happen when the vnetd service is up and running. If a job cancellation occurs during the startup of bptm or bpdm process on the media server, then the initialization of the EMM Orb can be interrupted by the termination signal (1=SIGHUP), and connections will not be passed to the vnetd proxy process. After 120 seconds of repeated connection retries, the process will fail with either status 47 or status 83.
...snip...
01:38:33.577 [204449] <2> Orb::init: initializing ORB EMMlib_Orb with:
...snip...
01:38:34.075 [204449] <2> Orb::setOrbConnectTimeout: timeout seconds: 60(Orb.cpp:1670)
01:38:34.075 [204449] <2> Orb::setOrbRequestTimeout: timeout seconds: 1800(Orb.cpp:1679)
01:38:34.079 [204449] <2> Media_library_signal_poll: 1:Terminate detected
01:38:34.080 [204449] <2> vnet_proxy_helper_connect: Termination callback function indicates we are shutting down
01:38:34.080 [204449] <16> vnet_proxy_socket_swap: vnet_proxy_helper_connect() failed: 48
01:38:34.080 [204449] <2> ProxyConnector::connectToProxy(): vnet_proxy_socket_swap failed, retval: 48(ProxyConnector.cpp:125)
01:38:34.081 [204449] <2> Media_library_signal_poll: 1:Terminate detected
01:38:34.081 [204449] <2> vnet_proxy_helper_connect: Termination callback function indicates we are shutting down
01:38:34.081 [204449] <16> vnet_proxy_socket_swap: vnet_proxy_helper_connect() failed: 48
01:38:34.081 [204449] <2> ProxyConnector::connectToProxy(): vnet_proxy_socket_swap failed, retval: 48(ProxyConnector.cpp:125)
01:38:35.084 [204449] <2> Media_library_signal_poll: 1:Terminate detected
...snip...
01:38:35.086 [204449] <2> Media_library_signal_poll: 1:Terminate detected
...snip...
01:38:37.095 [204449] <2> Media_library_signal_poll: 1:Terminate detected
...snipped repeated connection retries ever 2 second until retries are exhausted after 2 minutes...
01:40:36.406 [204449] <2> Media_library_signal_poll: 1:Terminate detected
01:40:36.406 [204449] <2> vnet_proxy_helper_connect: Termination callback function indicates we are shutting down
01:40:36.406 [204449] <16> vnet_proxy_socket_swap: vnet_proxy_helper_connect() failed: 48
01:40:36.406 [204449] <2> ProxyConnector::connectToProxy(): vnet_proxy_socket_swap failed, retval: 48(ProxyConnector.cpp:125)
01:40:37.408 [204449] <8> Orb::connectToObjectRetries: Object was never initialized before the max timeout
01:40:37.408 [204449] <16> emmlib_initializeEx: (-) Exception! CORBA::TRANSIENT
...snip...
01:40:42.715 [204449] <2> bptm: EXITING with status 47 <----------
or
xx:xx:xx.xxx [xxxxxx] <2> bpdm: EXITING with status 83
This will result in up to 60 connections from the NB_8.1+ media server to nbemm on the master server. These are inter-host connections, but have no data visible in the TCP Snd and Rcv queues on the master server until bptm or bptm exits. Connections that arrive, from other NB_8.1+ media servers where vnetd is up and can be contacted by nbrmms/bptm/bpdm/etc, during that two minute delay will show inbound data waiting for ingest. E.g.
$ netstat -naopt | grep nbemm | egrep -v ' 127.0.0.1:| 192.168.0.15:.* 192.168.0.15:'
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
Mon Dec 17 01:38:48 CST 2018
tcp 0 0 192.168.0.15:1556 192.168.0.12:35744 ESTABLISHED 15328/nbemm keepalive (7060.24/0/0)
tcp 0 0 192.168.0.15:1556 192.168.0.12:37890 ESTABLISHED 15328/nbemm keepalive (7089.61/0/0)
tcp 0 0 192.168.0.15:1556 192.168.0.12:29447 ESTABLISHED 15328/nbemm keepalive (7156.84/0/0)
...snipped additional connections...
Mon Dec 17 01:40:18 CST 2018
tcp 0 0 192.168.0.15:1556 192.168.0.12:29447 ESTABLISHED 15328/nbemm keepalive (7092.42/0/0)
tcp 0 0 192.168.0.15:1556 192.168.0.12:23457 ESTABLISHED 15328/nbemm keepalive (7102.73/0/0)
tcp 776 0 192.168.0.15:1556 192.168.1.11:45261 ESTABLISHED 15328/nbemm keepalive (7119.01/0/0)
tcp 0 0 192.168.0.15:1556 192.168.0.12:19284 ESTABLISHED 15328/nbemm keepalive (7165.32/0/0)
tcp 1154 0 192.168.0.15:1556 192.168.0.12:26384 ESTABLISHED 15328/nbemm keepalive (7184.86/0/0)
...snipped additional connections...
Mon Dec 17 01:40:48 CST 2018
Mon Dec 17 01:41:18 CST 2018
Work Around:
Ensure that all processes are either stopped or fully running on the media servers; especially vnetd and nbrmms. Either start vnetd or shutdown nbrmms and all of NetBackup.
Use "bpclntcmd -get_remote_host_version <hostname>" to check communication to the media server.
Example:
C:\Program Files\Veritas\NetBackup\bin>bpclntcmd -get_remote_host_version media-server
8.1
When successful, the bpclntcmd debug log on the master server will show a successful connection to bprd for a C_REMOTE_HOST_VERSION request. The bprd debug log will show connections to the remote host. First bpcd via PBX with use of the vnetd proxy, and then vnetd via PBX.
C:\Program Files\Veritas\NetBackup\logs\bpclntcmd\ALL_ADMINS.102318_00001.log
10:15:00.253 [10164.8104] <2> logparams: -get_remote_host_version media-server
...snip...
10:15:00.253 [10164.8104] <2> logconnections: BPRD CONNECT FROM 192.168.2.100.63542 TO 192.168.2.100.13720 fd = 472
C:\Program Files\Veritas\NetBackup\logs\bprd\ALL_ADMINS.102318_00001.log
10:15:00.534 [7040.9936] <2> daemon_proxy_proto: Preparing to do daemon protocol for (192.168.2.100:13720 <- 192.168.2.100:63542)
10:15:00.534 [7040.9936] <2> logconnections: BPRD ACCEPT FROM 192.168.2.100.63542 TO 192.168.2.100.13720 fd = 864
...snip...
10:15:00.534 [7040.9936] <2> process_request: command C_REMOTE_HOST_VERSION (43) received
10:15:00.534 [7040.9936] <2> bpcd_remote_host_version_cmd: from client master-server: DestHost:media-server
10:15:00.534 [7040.9936] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
...snip...
10:15:00.565 [7040.9936] <2> logconnections: PROXY CONNECT FROM 192.168.2.100.63543 TO 192.168.2.102.1556 fd = 752
10:15:00.565 [7040.9936] <2> logconnections: BPCD CONNECT FROM 127.0.0.1.63545 TO 127.0.0.1.63546 fd = 752
...snip...
10:15:00.565 [7040.9936] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
...snip...10:15:00.627 [7040.9936] <8> do_pbx_service: [vnet_connect.c:2581] via PBX VNETD CONNECT FROM 192.168.2.100.63547 TO 192.168.2.102.1556 fd = 780
10:15:00.627 [7040.9936] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:458] VN_REQUEST_CONNECT_FORWARD_SOCKET 10 0xa
10:15:00.627 [7040.9936] <8> vnet_vnetd_connect_forward_socket_begin: [vnet_vnetd.c:483] ipc_string /usr/openv/var/tmp/vnet-22203540307700635927000000003-dJXd0O
10:15:00.674 [7040.9936] <2> bpcr_get_version_rqst: bpcd version: 08100000
10:15:00.674 [7040.9936] <2> bpcd_remote_host_version_cmd: CLIENT_CMD_SOCK from bpcr = 752
10:15:00.674 [7040.9936] <2> bpcd_remote_host_version_cmd: CLIENT_DATA_SOCK from bpcr = 780
When vnetd is down on the remote media server, the command fails. The bprd log shows the first connection to PBX, but then an error during vnetd_proxy_socket_swap - because vnetd is not running - and the second connection is never attempted.
Example:
C:\Program Files\Veritas\NetBackup\bin>bpclntcmd -get_remote_host_version media-server
socket read failed
C:\Program Files\Veritas\NetBackup\logs\bprd\ALL_ADMINS.102318_00001.log
10:19:21.596 [9568.7264] <2> logparams: -get_remote_host_version media-server
10:19:21.627 [9568.7264] <2> logconnections: BPRD CONNECT FROM 192.168.2.100.63662 TO 192.168.2.100.13720 fd = 468
C:\Program Files\Veritas\NetBackup\logs\bprd\ALL_ADMINS.102318_00001.log
10:19:21.877 [4428.9464] <2> logconnections: BPRD ACCEPT FROM 192.168.2.100.63662 TO 192.168.2.100.13720 fd = 864
...snip...
10:19:21.877 [4428.9464] <2> process_request: command C_REMOTE_HOST_VERSION (43) received
...snip...
10:19:21.877 [4428.9464] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded...snip...
10:19:21.893 [4428.9464] <16> dump_proxy_info: statusmsg: A socket read failed. Status: -x Errno: 10053 Msg: Unknown error., nbu status = 23, severity = 2, Additional Message: [PROXY] Encountered error (CERT_PROTOCOL_READING_JSON_LENGTH) while processing(CertProtocol)., nbu status = 2, severity = 1
10:19:21.893 [4428.9464] <16> connect_to_service: vnet_proxy_socket_swap(NULL, -1, 310, 0, media-server, bpcd, 1) failed: 9
10:19:21.893 [4428.9464] <8> vnet_connect_to_bpcd: [vnet_connect.c:565] connect_to_service() failed 9 0x9
10:19:21.893 [4428.9464] <16> local_bpcr_connect: vnet_connect_to_bpcd(media-server) failed: 9
10:19:21.893 [4428.9464] <2> local_bpcr_connect: Can't connect to client media-server
10:19:21.893 [4428.9464] <2> ConnectToBPCD: bpcd_connect_and_verify(media-server, media-server) failed: 23
10:19:21.893 [4428.9464] <2> bpcd_remote_host_version_cmd: BPCD on client media-server exited with error, socket read failed
10:19:21.893 [4428.9464] <2> put_short: (10) network write() error: An operation was attempted on something that is not a socket. ; socket = -1
10:19:21.893 [4428.9464] <2> bpcr_disconnect_rqst: bpcr protocol error - couldn't send request type
10:19:21.893 [4428.9464] <2> process_request: bpcd_remote_host_version_cmd failed - status = socket read failed (23)
If vnetd is running on the remote media servers, the failure or cancellation of jobs will allow nbemm to catch up to the inbound connection load. Once that happens, jobs will resume to be successful. To prevent the intermittent failures, apply the hotfix noted below.
Solution:
The issue has been addressed in the version(s) of the product specified at the end of this article but if you cannot upgrade at this time a hotfix is available for this issue in previous versions of the product as mentioned below.
A supported hotfix (ET3957010) has been made available for this issue. Please contact Veritas Technical Support to obtain this fix. This hotfix has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix.
Veritas Technologies LLC currently plans to address this issue by way of a patch or hotfix to the current version of the software. Please note that Veritas Technologies LLC reserves the right to remove any fix from the targeted release if it does not pass quality assurance tests. Veritas’ plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.
Please contact your Veritas Sales representative or the Veritas Sales group for upgrade information including upgrade eligibility to the release containing the resolution for this issue.
Note : A recent issue is observed and fixed with NB 8.1.2 Clustered Master server with a mixed environment of 8.x , 7.x Media servers.
Applies To
- Many jobs fail with status 2106 or status 47 when nbemm is unable to accept connections in a timely fashion due to vnet_proxy_socket_swap failures on a remote media server that includes ET 3957010 fix.
Binaries are available for several versions:
NBU Version | Etrack number |
---|---|
8.1 | 3957010 |
8.1.1 | 3966404 |
8.1.2 | 3973044** |
** This is an eeb bundle.
- This EEB needs to be applied on Media Server.
- This can affect any multi-threaded or single-threaded process that connects to nbemm. It can happen when vnetd is down, or if the vnet_proxy_socket_swap encounters errors.
Error conditions include TCP connection drops, or TCP retransmissions that exceed the swap timeout, or process receiving a terminate signal before or during swap, etc. The condition is present if both nbemm is slow to respond to connection attempts and 'netstat -naopt' (Linux) or 'netstat -nau' (Solaris 11) shows inter-host connection held by nbemm where the remote host is NB_8.1+. lsof, pfiles, and similar utilities can show the same information.
Alternatively, on any platform, verify by checking for delays between PBX accept of the connection and nbemm proxy swap of the same source IP/port; the nbemm log must have 137.DebugLevel=6 to show the IP/port pairs.
Fixed in:
This issue is fixed in the following release(s), available in the Download Center at https://www.veritas.com/support/en_US/downloads
NetBackup 8.2 and newer versions.