Connection establishment fails during SSL shutdown close notify processing

Article: 100047973
Last Published: 2020-12-15
Ratings: 1 1
Product(s): Appliances, NetBackup & Alta Data Protection

Problem

By design, some NetBackup processes establish a TLS session after connecting to each other to validate the connecting and accepting hosts, but then terminate the TLS session before continuing communication.  In some situations, it has been observed that the SSL shutdown close notify processing does not complete cleanly and the connection is closed by one end before further communication can occur.

The problem has only been reported on master servers, but three scenarios are possible.  These types of connections can potentially be affected.

  1. bpbrm (on media server) to bpdbm (on master server) connection to update the image metadata; causing the backup to fail.
  2. CLI (by non-root user) to service (both on master server) connection when NetBackup Access Control (NBAC) is enabled; preventing the CLI from working.
  3. bpbrm or bptm (on media server) to vnetd (on client) connections to transport a backup image or metadata; causing the backup to fail.

The issue is intermittent, and may lessen or disappear entirely when the debug level of the nbpxyhelper log is increased.  On some hosts setting 486.DebugLevel=1 has completely avoided the condition, on others a value of 4 has only reduced the frequency of occurrence.  Some hosts do not see any change when adjusting the debug level.

The most common symptom is a backup job failing with status 233 during bpbrm db_FLISTsend connection establishment.  But other types of connection failures can cause the same symptom, check the debug logs to be certain.  Many other external symptoms are possible, but the unique details are only in the debug logs from the processes.

Log Messages

Example 1 - nbpxyhelper SSL shutdown close notify

Three symptoms are present in the nbpxyhelper (OID 486) debug logs from the connecting and accepting hosts during TRANSPORT_SHUTDOWN_PROTOCOL_HANDSHAKE.

The debug log on one of the hosts will show the first symptom, that 'waiting for peer close notify' fails when 'shutdown while in init' occurs.

16:55:54.562 [Debug] ... 4 [TransportShutdownProtocol::setState] Transitioning state from TRANSPORT_SHUTDOWN_PROTOCOL_NOT_STARTED to TRANSPORT_SHUTDOWN_PROTOCOL_HANDSHAKE 
16:55:54.562 [Debug] ... 6 [SSLSocket::shutdown] SSL shutdown close notify completed, waiting for peer close notify 
16:55:54.562 [Debug] ... 1 [SSLSocket::shutdown] operation returned unhandled error 1 
16:55:54.563 [Debug] ... 1 [SSLSocket::shutdown] SSL connection was broken. Status: Msg: Error 336462231: error:140E0197:SSL routines:SSL_shutdown:shutdown while in init
16:55:54.563 [Debug] ... 1 [TransportShutdownTask::runTask] Error: Transport shutdown failed with socket error: SocketError:
16:55:54.563 [Debug] ... 1 [TransportShutdownProtocol::asyncIoTaskComplete] Error: Task completed w/an error.
16:55:54.563 [Debug] ... 1 [TransportShutdownProtocol::asyncIoTaskComplete] Transport Shutdown Task:
16:55:54.563 [Debug] ... 1 [TransportShutdownProtocol::asyncIoTaskComplete] SSL connection was broken. Status: Msg: Error 336462231: error:140E0197:SSL routines:SSL_shutdown:shutdown while in init
16:55:54.563 [Debug] ... 4 [TransportShutdownProtocol::setState] Transitioning state from TRANSPORT_SHUTDOWN_PROTOCOL_HANDSHAKE to TRANSPORT_SHUTDOWN_PROTOCOL_ERROR 


The debug log on the other host, will show the other two symptoms.  That SSL shutdown close notify was completed by both hosts, and that the hand back protocol was to run next.

16:55:54.562 [Debug] ... 6 [SSLSocket::shutdown] SSL shutdown close notify completed, waiting for peer close notify 
16:55:54.563 [Debug] ... 6 [SSLSocket::shutdown] SSL shutdown received peer close notify, shutdown complete 
16:55:54.563 [Debug] ... 6 [TransportShutdownTask::runTask] Transport shutdown completed successfully. 
16:55:54.563 [Debug] ... 6 [TransportShutdownProtocol::asyncIoTaskComplete] Handling successful task completion event. 
16:55:54.563 [Debug] ... 6 [TransportShutdownProtocol::asyncIoTaskComplete] Transport shutdown completed successfully. 
16:55:54.563 [Debug] ... 4 [TransportShutdownProtocol::setState] Transitioning state from TRANSPORT_SHUTDOWN_PROTOCOL_HANDSHAKE to TRANSPORT_SHUTDOWN_PROTOCOL_COMPLETE 
16:55:54.563 [Debug] ... 4 [ProxyConnection::handleTransportShutdownCompletion] Transport shutdown protocol completed successfully. 
16:55:54.563 [Debug] ... 4 [ProxyConnection::handleTransportShutdownCompletion] Running hand back protocol next. 


The above sequence is important.  An SSL shutdown will be attempted at the end of every secure comms connection, and may log a similar error depending on timing of events on the network and in the processes.  However, in the situations of interest, the SSL shutdown is happening during connection establishment, not at the time of disconnect.  The need for hand back protocol indicates that this connection is intended to continue to be used.

Notice some of the log messages above appear only at DebugLevel=4 or 6.

When the nbpxyhelper log is not available, the debug logs from the connecting process and the accepting service can be used to confirm the situation.

Example 2 - bpbrm to bpdbm to update the backup image metadata (Windows)

The bpbrm debug log shows that, during connection establishment, there was a premature close of the connection to bpdbm.  Note both that the connection was no longer being secured by the proxy processes (a bpdbm-auth-only connection), and that was intended to be used to update the files list for the backup image (db_FLISTsend).  The results is a status 233 for the backup job.

00:04:12.562 [20228.8140] <2> logconnections: BPDBM CONNECT FROM 10.10.20.31.50483 TO 10.10.20.30.1556 fd = 1020
00:04:12.562 [20228.8140] <2> db_begin: auth only query 78, socket 1020 is not proxied
00:04:12.562 [20228.8140] <2> get_exactly_n_bytes_or_eof_abs: premature EOF found
00:04:12.562 [20228.8140] <2> ts_get_long_abs: error reading long from socket: The process cannot access the file because it is being used by another process. (32)
00:04:12.562 [20228.8140] <2> ts_get_adaptable_string: error reading long from socket: The process cannot access the file because it is being used by another process.  (32)
00:04:12.562 [20228.8140] <2> db_getdata: ts_get_string_handle() failed: The process cannot access the file because it is being used by another process.  (32) 
premature end of file encountered (-9) WSAGetLastError(): 0
00:04:12.562 [20228.8140] <2> db_startrequest: protocol error 1
00:04:12.562 [20228.8140] <16> db_begin: db_startrequest() failed: premature eof encountered
00:04:12.562 [20228.8140] <2> db_FLISTsend: db_begin() failed: premature eof encountered
00:04:12.562 [20228.8140] <16> non_mpx_backup_archive_verify_import: db_FLISTsend failed: premature eof encountered (233)

The bpdbm debug log for the other end of the same connection shows that it had not yet completed the hand back protocol, and that on Windows the vnet_proxy_handoff_receive API returned a winsock 10107 (WSASYSCALLFAILURE - System call failure) because the nbpxyhelper closed the connection without sending the expected socket information.  This results in a secure proxy protocol failure.

00:04:12.719 [17856.24872] <2> DbmParentInChild::handle_input: Successfully received fd(1012) from parent(BPDBM ACCEPT FROM 10.10.20.31.50483 TO 
10.136.21.130.1556
fd = 908), request count(5)
00:04:12.719 [17856.24872] <2> daemon_proxy_proto: Preparing to do daemon protocol for (10.10.20.30:1556 <- 10.10.20.31:50483)
00:04:12.734 [17856.24872] <16> vnet_proxy_handoff_receive: WSASocket() failed: 10107
00:04:12.734 [17856.24872] <16> vnet_proxy_protocol_from_legacy_ex: vnet_proxy_handoff_receive() failed: 9
00:04:12.734 [17856.24872] <16> vnet_proxy_socket_swap: vnet_proxy_protocol_from_legacy() failed: 9
00:04:12.734 [17856.24872] <16> daemon_proxy_proto: vnet_proxy_socket_swap() failed: vnet status 9, nb status 23
00:04:12.734 [17856.24872] <16> process_request: Secure proxy protocol failed

Example 3 - bpbrm to bpdbm to update the backup image metadata (UNIX/Linux)

The bpbrm debug log shows the same failure sequence as the Windows example above.

21:27:08.766 [200327.200327] <2> db_begin: auth only query 78, socket 19 is not proxied
21:27:08.766 [201136.201136] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
21:27:08.766 [200327.200327] <2> get_exactly_n_bytes_or_eof_abs: premature EOF found
21:27:08.766 [200327.200327] <2> ts_get_long_abs: error reading long from socket: Broken pipe (32)
21:27:08.766 [200327.200327] <2> ts_get_adaptable_string: error reading long from socket: Broken pipe (32)
21:27:08.766 [200327.200327] <2> db_getdata: ts_get_string_handle() failed: Broken pipe (32) premature end of file encountered (-9)
21:27:08.766 [200327.200327] <2> db_startrequest: protocol error 1
21:27:08.766 [200327.200327] <16> db_begin: db_startrequest() failed: premature eof encountered
21:27:08.766 [200327.200327] <2> db_FLISTsend: db_begin() failed: premature eof encountered
21:27:08.766 [200327.200327] <16> non_mpx_backup_archive_verify_import: db_FLISTsend failed: premature eof encountered (233)
...snip...
21:27:14.617 [200327.200327] <2> bpbrm Exit: client backup EXIT STATUS 233: premature eof encountered

The bpdbm debug log shows the same failure sequence as the Windows example above, except that the vnet_proxy_handoff_receive API on UNIX/Linux doesn't have access to the Windows Socket error numbers so it logs a slightly different message.

21:27:08.712 [18093] <2> DbmParentInChild::handle_input: Successfully received fd(9) from parent(BPDBM ACCEPT FROM 10.20.30.40.47967 TO 10.40.30.20.1556 fd = 30), request count(191)
21:27:08.712 [18093] <2> daemon_proxy_proto: Preparing to do daemon protocol for (10.40.30.20:1556 <- 10.20.30.40:47967)
21:27:08.764 [18093] <16> vnet_proxy_handoff_receive: recvmsg() msg_iov = 0x00 0x00
21:27:08.765 [18093] <16> vnet_proxy_protocol_from_legacy_ex: vnet_proxy_handoff_receive() failed: 9
21:27:08.765 [18093] <16> vnet_proxy_socket_swap: vnet_proxy_protocol_from_legacy() failed: 9
21:27:08.765 [18093] <16> daemon_proxy_proto: vnet_proxy_socket_swap() failed: vnet status 9, nb status 23
21:27:08.765 [18093] <16> process_request: Secure proxy protocol failed

Example 4 - CLI (running as non-root) failure when NBAC is configured

This is the bplist debug log showing that a connection to the bprd service was partially successfully using NBAC and the secure comm proxy.  But after the hand back protocol, the next read from the connection encountered VN_STATUS_READ_FAILED (9) because the connection had been dropped by a process on the remote end.  As a result, the CLI failed with status 23.

16:55:54.493 [48946] <4> main: real locales <C>
16:55:54.493 [48946] <4> main: standardized locales - lc_messages <C> lc_ctype <C> lc_time <C> lc_collate <C> lc_numeric <C>
16:55:54.493 [48946] <2> logparams: -C client01 -t 0 -X -s 1592092800 -R /d/client01/d1/*
16:55:54.493 [48946] <2> process_args: /d/client01/d1/*
16:55:54.503 [48946] <2> vnet_pbxConnect_ex: pbxConnectExEx Succeeded
16:55:54.565 [48946] <2> logconnections: BPRD CONNECT FROM 10.10.10.10.62466 TO 10.10.10.10.1556 fd = 4
16:55:54.568 [48946] <16> EAT_LOG: (../vrtsatflock.c,158)Opening lock file failed. Errno : 13
16:55:54.569 [48946] <8> vnet_pop_byte: [vnet.c:227] errno 0 0x0
16:55:54.569 [48946] <2> vnet_pop_byte: vnet.c.229: 0: Function failed: 9 0x00000009
16:55:54.569 [48946] <2> vnet_pop_string: vnet.c.309: 0: Function failed: 9 0x00000009
16:55:54.569 [48946] <2> vnet_pop_signed: vnet.c.352: 0: Function failed: 9 0x00000009
16:55:54.569 [48946] <8> vnet_check_vxss_client_magic_with_info_ex: [vnet_vxss_helper.c:1274] vnet_pop_signed() failed 9 0x9
16:55:54.569 [48946] <8> vnet_check_vxss_client_magic_with_info: [vnet_vxss_helper.c:941] vnet_check_vxss_client_magic_with_info failed 9 0x9
16:55:54.569 [48946] <2> bprd_connect_need_auth: vnet_check_vxss_client_magic_with_info(server01.com) failed: 9
16:55:54.569 [48946] <16> send_request: Can't connect to host master01.com: socket read failed (23)
16:55:54.569 [48946] <4> bplist:exit(): EXIT STATUS 23: socket read failed

The bprd debug log shows the listener PID accepting the connection, but the child PID handling the vnet_proxy_socket_swap API encountered the same vnet_proxy_handoff_receive API error as bpdbm in the above example.  This is all visible even with VERBOSE=0 configured.

00:00:01.239 [7574.7574] <2> listen_loop: CONTINUING...good morning bprd (VERBOSE = 0): NetBackup 8.2 2019062501 on master01.com
...snip...
16:55:54.503 [7574.7574] <2> vnet_pbxAcceptSocket_ex: Accepted sock[19] from 40.30.20.10:62466
...snip...
16:55:54.504 [48949.48949] <2> daemon_proxy_proto: Preparing to do daemon protocol for (10.10.10.10:1556 <- 10.10.10.10:62466)
16:55:54.564 [48949.48949] <16> vnet_proxy_handoff_receive: recvmsg() msg_iov = 0x00 0x00
16:55:54.564 [48949.48949] <16> vnet_proxy_protocol_from_legacy_ex: vnet_proxy_handoff_receive() failed: 9
16:55:54.564 [48949.48949] <16> vnet_proxy_socket_swap: vnet_proxy_protocol_from_legacy() failed: 9
16:55:54.564 [48949.48949] <16> daemon_proxy_proto: vnet_proxy_socket_swap() failed: vnet status 9, nb status 23
16:55:54.564 [48949.48949] <32> bprd: secure proxy protocol failed
17:02:00.957 [7574.7574] <2> vnet_pbxAcceptSocket_ex: Accepted sock[19] from 40.30.20.10:59601

Cause

The problem is due to a SSL error condition that occurred previously during another operation in the nbpxyhelper.  The error notify queue is cleared regularly, but was not being cleared before the SSL shutdown close notify processing.  The unexpected notification was causing the unexpected SSL init condition, and a socket is not handed back to the local process.

The nbpxyhelper on the other end of the condition could potentially, but typically does not encounter a similar condition.  As a result, the socket is handed back to the local process, which then encounters an error because the connection is then closed by the remote nbpxyhelper.

Increasing the debug level for nbpxyhelper appears to solve the problem on some hosts, but only because the logging slows down the thread slightly and other processing may clear the notify queue before the SSL shutdown close notify is processed.  

Solution

This problem is resolved in NetBackup 8.3.

There are no plans to address this issue by way of a patch or hotfix in earlier versions of the software at the present time.  However, the issue has been addressed in the revision of the product specified in this article.  Please contact your Veritas Sales representative or the Veritas Sales group for upgrade information including upgrade eligibility to the release containing the resolution for this issue.

 

References

Etrack : 4007237 Etrack : 4006883 Etrack : 4007318 Etrack : 4004942

Was this content helpful?