Problem
When there are a large number of network interfaces on the host, and the Preferred Network configuration prohibits at least one of the local interfaces from being used, it is observed that there are significant delays connecting to some hosts.
These delays most commonly occur between when bpbrm is started and when it is connected to the client. Once connected, the client and bptm will complete the data transfer normally, sometime in only a few seconds for small jobs. The connection delays are more noticeable with small jobs because they account for a larger percentage of the elapse time for the job. Large backups are equally affected, but may not be noticed.
This problem may affect CORBA connections, but the effect is less because the delay can occur only once per source and destination process due to the ConnectList cache, and the delay is shorter.
These delays will also affect restore, and other operations, between the same hosts.
Error Message
There are no error messages, but the Job Details show the delays after the job becomes active. There will be two or three spots, depending on job type, where there will be delays of 20 or more seconds. In this example, the data transfer takes only 2 seconds out of the total elapse time of 287 seconds.
...snip...
12:05:10 - Info nbjm (pid=28274) started backup (backupid=myclient_1393869869) job ...
…delay…
12:06:10 - Info bpbrm (pid=3429) myclient is the host to backup data from
12:06:10 - Info bpbrm (pid=3429) reading file list for client
…delay…
12:07:41 - connecting
12:07:41 - Info bpbrm (pid=3429) listening for client connection
…delay…
12:09:43 - Info bpbrm (pid=3429) accepted connection from client
12:09:43 - Info dbclient (pid=25892) Backup started
12:09:43 - Info bpbrm (pid=3429) bptm pid: 18775
12:09:43 - Info bptm (pid=6553) start
12:09:43 - connected; connect time: 0:02:02
...snip...
12:09:52 - Info dbclient (pid=25892) dbclient(pid=10427) wrote first buffer(size=262144)
12:09:54 - Info dbclient (pid=25892) dbclient waited 346 times for empty buffer, delayed 445 times
12:09:54 - Info dbclient (pid=25892) done. status: 0
12:09:54 - Info bptm (pid=6553) waited for full buffer 206 times, delayed 5672 times
12:09:57 - Info bptm (pid=6553) EXITING with status 0 <----------
The bpbrm debug log shows there are no delays connecting to the master server. Notice that there are preferred network settings configured. At least one of the local interfaces is prohibited, and there is not a match for the destination IP address. If the master server was on the 2.2.2.0/24 network, a 10 second delay would be seen here.
...snip...
12:05:10.504 [3429] <2> verify_client: ... *connect_opts: 16974338 16974338 0x01030202
12:05:10.504 [3429] <2> ConnectionCache::connectAndCache: Acquiring new connection for host mymaster, query type 223
12:05:10.504 [3429] <2> vnet_get_pref_netconnection: ... targaddr=1.1.1.1 No match, some prohibited, prefnet=IF_LIST
12:05:10.506 [3429] <2> logconnections: BPDBM CONNECT FROM 1.1.1.2.46242 TO 1.1.1.1.1556 fd = 4
However, when connecting to bpcd on the client host there is 3 delays of 10 seconds each. The delays come between each of 4 attempts to connect to the host. The fourth is successful to the PBX port so this is not a problem with blocked ports in the firewall or service not listening. Notice that the preferred network settings also do not match on this target IP address.
12:05:10.554 [3429] <2> bpbrm start_bpcd_stat: ... bpcd_client: myclient
...snip...
12:05:10.556 [3429] <2> vnet_get_pref_netconnection: ... targaddr=4.4.4.3 No match, some prohibited, prefnet=IF_LIST
12:05:10.556 [3429] <2> async_connect: ... connect in progress 1 0x1
12:05:20.566 [3429] <2> async_connect: ... connect in progress 2 0x2
12:05:30.576 [3429] <2> async_connect: ... connect in progress 3 0x3
12:05:40.586 [3429] <2> async_connect: ... connect in progress 4 0x4
12:05:40.588 [3429] <2> logconnections: BPCD CONNECT FROM 4.4.4.2.34346 TO 4.4.4.3.1556 fd = 8
The same sequence of delays occurs when connecting to vnetd for the forward socket that bpcd will also use.
12:05:40.589 [3429] <2> vnet_get_pref_netconnection: ... targaddr=4.4.4.3 No match, some prohibited, prefnet=IF_LIST
12:05:40.590 [3429] <2> async_connect: ... connect in progress 1 0x1
12:05:50.784 [3429] <2> async_connect: ... connect in progress 2 0x2
12:06:00.785 [3429] <2> async_connect: ... connect in progress 3 0x3
12:06:10.795 [3429] <2> async_connect: ... connect in progress 4 0x4
12:06:10.797 [3429] <2> async_connect: ... connect async CONNECT FROM 4.4.4.2.44451 TO 4.4.4.3.1556 fd = 9
12:06:10.798 [3429] <8> vnet_vnetd_connect_forward_socket_begin: ... VN_REQUEST_CONNECT_FORWARD_SOCKET 10 0xa
This first bpcd connection was to verify several settings on the client.
12:06:10.930 [3429] <2> bpcr_get_hostname_rqst: Server hostname length = 12
12:06:10.930 [3429] <2> bpcr_get_version_rqst: bpcd version: 07500004
12:06:10.937 [3429] <2> verify_client: ... bpcr_update_resiliency : 0 0 0x00000000
12:06:10.938 [3429] <2> verify_client: ... *connect_opts: 16843008 16843008 0x01010100
...snip...
The next pair of connections to bpcd and vnetd incurs the same 10 second delays.
12:06:10.946 [3429] <2> async_connect: ... connect in progress 1 0x1
12:06:20.956 [3429] <2> async_connect: ... connect in progress 2 0x2
12:06:30.963 [3429] <2> async_connect: ... connect in progress 3 0x3
12:06:40.973 [3429] <2> async_connect: ... connect in progress 4 0x4
12:06:40.974 [3429] <2> logconnections: BPCD CONNECT FROM 4.4.4.2.36029 TO 4.4.4.3.1556 fd = 9
12:06:40.975 [3429] <2> async_connect: ... connect in progress 1 0x1
12:06:50.976 [3429] <2> async_connect: ... connect in progress 2 0x2
12:07:00.984 [3429] <2> async_connect: ... connect in progress 3 0x3
12:07:10.994 [3429] <2> async_connect: ... connect in progress 4 0x4
12:07:10.996 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.55183 TO 4.4.4.3.1556 fd = 10
12:07:11.115 [3429] <2> bpcr_get_version_rqst: bpcd version: 07500004
12:07:11.115 [3429] <2> bpcr_get_charset_rqst: bpcd charset: 0000000d
...snip...
Because this is an Oracle stream-based application backup using XBSA, there are three vnetd forward sockets created to the client to be used as the New Data, Data, and Name connections. Each incurs the same delays. This would also be true of DB2, Informix, SAP, SQL-Server, Sybase, and DataStore backup jobs.
12:07:11.160 [3429] <2> async_connect: ... connect in progress 1 0x1
12:07:21.166 [3429] <2> async_connect: ... connect in progress 2 0x2
12:07:31.176 [3429] <2> async_connect: ... connect in progress 3 0x3
12:07:41.186 [3429] <2> async_connect: ... connect in progress 4 0x4
12:07:41.188 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.54261 TO 4.4.4.3.1556 fd = 9
...snip...
12:07:41.267 [3429] <2> set_job_details: Tfile (271461): LOG 1393870061 4 bpbrm 3429 listening for client connection
12:07:41.267 [3429] <2> bpbrm listen_for_client: HOT_ORACLE_DB_BACKUP
...snip...
12:07:41.268 [3429] <2> async_connect: ... connect in progress 1 0x1
12:07:51.270 [3429] <2> async_connect: ... connect in progress 2 0x2
12:08:01.278 [3429] <2> async_connect: ... connect in progress 3 0x3
12:08:11.288 [3429] <2> async_connect: ... connect in progress 4 0x4
12:08:11.289 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.46340 TO 4.4.4.3.1556 fd = 10
12:08:11.366 [3429] <2> bpbrm listen_for_client: ... accept_socket_data: 10 10 0x0000000a
...snip...
12:08:11.368 [3429] <2> async_connect: ... connect in progress 1 0x1
12:08:21.378 [3429] <2> async_connect: ... connect in progress 2 0x2
12:08:31.388 [3429] <2> async_connect: ... connect in progress 3 0x3
12:08:41.390 [3429] <2> async_connect: ... connect in progress 4 0x4
12:08:41.392 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.40519 TO 4.4.4.3.1556 fd = 11
12:08:41.471 [3429] <2> bpbrm listen_for_client: ... accept_socket_name: 11 11 0x0000000b
...snip...
The same delays are present when making the connections, to bpcd and vnetd, needed to update the comm file. Those updates make the client aware of the prior connections and the rest of the job attributes.
12:08:41.473 [3429] <2> async_connect: ... connect in progress 1 0x1
12:08:51.483 [3429] <2> async_connect: ... connect in progress 2 0x2
12:09:01.490 [3429] <2> async_connect: ... connect in progress 3 0x3
12:09:11.500 [3429] <2> async_connect: ... connect in progress 4 0x4
12:09:11.501 [3429] <2> logconnections: BPCD CONNECT FROM 4.4.4.2.59704 TO 4.4.4.3.1556 fd = 12
...snip...
12:09:11.503 [3429] <2> async_connect: ... connect in progress 1 0x1
12:09:21.513 [3429] <2> async_connect: ... connect in progress 2 0x2
12:09:31.524 [3429] <2> async_connect: ... connect in progress 3 0x3
12:09:41.534 [3429] <2> async_connect: ... connect in progress 4 0x4
12:09:41.535 [3429] <8> do_pbx_service: ... via PBX VNETD CONNECT FROM 4.4.4.2.54019 TO 4.4.4.3.1556 fd = 13
...snip...
12:09:43.001 [3429] <2> bpbrm write_msg_to_progress_file: INF - Data socket = mymm-bk.IPC:/tmp/vnet-...
12:09:43.005 [3429] <2> bpbrm write_msg_to_progress_file: INF - Name socket = mymm-bk.IPC:/tmp/vnet-...
12:09:43.006 [3429] <2> bpbrm write_msg_to_progress_file: INF - Job id = 271461
12:09:43.006 [3429] <2> bpbrm write_msg_to_progress_file: INF - Backup id = myclient_1393869869
12:09:43.010 [3429] <2> bpbrm write_msg_to_progress_file: INF - New data socket = mymm-bk.IPC:/tmp/vnet-...
12:09:43.010 [3429] <2> bpbrm write_msg_to_progress_file: INF - Encrypt = 0
12:09:43.011 [3429] <2> bpbrm write_msg_to_progress_file: INF - Use FAT shared memory
12:09:43.011 [3429] <2> bpbrm write_msg_to_progress_file: INF - Compression = 0
…snip…
Once the dbclient is connected to the Data, Name, and New Data connections, it sends the BACKUP START and bpbrm then starts bptm. Thereafter the data transfer is completed in a few seconds.
12:09:43.210 [3429] <2> set_job_details: Tfile (271461): LOG 1393870183 4 bpbrm 3429 accepted connection from client
12:09:43.250 [3429] <2> bpbrm main: from client myclient: read client start message
12:09:43.250 [3429] <2> bpbrm spawn_child: /usr/openv/netbackup/bin/bptm ...
12:09:43.251 [3429] <2> set_job_details: Tfile (271461): LOG 1393870183 4 bpbrm 3429 bptm pid: 6553
...snip...
12:09:46.994 [3429] <2> bpbrm mm_sig: received ready signal from media manager
12:09:52.265 [3429] <2> bpbrm main: from client myclient: change timeout to 10800
12:09:52.357 [3429] <2> set_job_details: ... dbclient(pid=25892) wrote first buffer(size=262144)
12:09:54.167 [3429] <2> bpbrm main: ADDED FILES TO DB FOR myclient_1393869869 1 /arc.DB.s5299.p1.t841233867
12:09:54.167 [3429] <2> set_job_details: ... dbclient waited 3 times for empty buffer, delayed 46 times
12:09:54.167 [3429] <2> bpbrm main: client myclient EXIT STATUS = 0: the requested operation was successfully completed
...snip...
12:09:54.167 [3429] <2> ConnectionCache::connectAndCache: Acquiring new connection for host mymaster, query type 78
12:09:54.170 [3429] <2> logconnections: BPDBM CONNECT FROM 1.1.1.2.60701 TO 1.1.1.1.1556 fd = 5
...snip...
12:09:54.236 [3429] <2> bpbrm wait_for_child: start
12:09:57.750 [3429] <2> bpbrm wait_for_child: child exit_status = 0 signal_status = 0
12:09:57.750 [3429] <2> bpbrm wait_for_child: child exit normal
12:09:57.750 [3429] <2> bpbrm main: validating image for client myclient
12:09:57.750 [3429] <2> ConnectionCache::connectAndCache: Acquiring new connection for host mymaster, query type 74
12:09:57.752 [3429] <2> logconnections: BPDBM CONNECT FROM 1.1.1.2.57337 TO 1.1.1.1.1556 fd = 5
...snip...
12:09:57.837 [3429] <2> bpbrm Exit: client backup EXIT STATUS 0: the requested operation was successfully completed
The bptm debug log does not show any delays in this example, because it connects only to the master server and not to the client host. It confirms that the data transfer took only a few seconds. If the master was on the 2.2.2.0/24 network, then bptm would have delays making both legacy and CORBA connections, 10 and 5 seconds respectively.
12:09:43.679 [6553] <2> bptm: INITIATING (VERBOSE = 5): -w -c myclient ...
...snip...
12:09:44.392 [6553] <2> io_init: child delay = 1, parent delay = 1 (milliseconds)
...snip...
12:09:46.994 [6553] <4> write_backup: begin writing backup id myclient_1393869869, ...
...snip...
12:09:52.359 [6553] <2> write_data: received first buffer (262144 bytes), begin writing data
...snip...
12:09:54.176 [6553] <2> write_data: writing short block, 32768 bytes, remainder 0
12:09:54.176 [6553] <2> write_data: waited for full buffer 206 times, delayed 5672 times
The sequence of events, shown by bpbrm above, can also be seen in the bptestbpcd output on the media server.
mymm$ bptestbpcd -client myclient -debug
Cause
The delays are due to the number of available network interfaces on the host, combined with having NetBackup configured to not use at least one of the interfaces. By prohibiting an interface, the operating system (OS) cannot be allowed to pick ANY source binding and instead NetBackup must try to make each outbound connection from each of the remaining local interfaces until one is successful.
In this example, the master server is network routable out the first interface (1.1.1.0/24) so there are no delays connecting to that host. But the client is not network routable via the first three interfaces (1.1.1.0/24, 2.2.2.0/24, or 3.3.3.0/24). For legacy connections this results in a 10 second delay per non-routable interface (30 seconds total), multiplied by the number of legacy connections (9) that need to be established to the client. As a result there are 270 seconds of connection delay to perform a data transfer that takes less than 10 seconds.
The delay for outbound CORBA connections is 5 seconds per non-routable source interface when connecting to the destination, multiplied by the number of unique processes on the remote host to which connections must be made.
The details are shown in the bptestnetconn connection (-c) output from bptestnetconn targeting the client name as the host (-H) and including the configured servers (-s) list. The connection (CN) delay shown is for a CORBA connection.
mymm$ bptestnetconn -v6 -a -p -t5 -o10 -c -s -Hmyclient
adding hostname = myclient
SERVER = mymaster
SERVER = mymm
...snip...
FL: myclient -> 4.4.4.3 : 1 ms FAST (< 5 sec) SRC: 1.1.1.2,2.2.2.2,3.3.3.2,4.4.4.2
FL: mymaster -> 1.1.1.1 : 1 ms FAST (< 5 sec) SRC: 1.1.1.2,2.2.2.2,3.3.3.2,4.4.4.2
FL: mymm -> 1.1.1.2 : 0 ms FAST (< 5 sec) [local] SRC: ANY
...snip...
CN: myclient : 15010 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes
CN: mymaster : 3 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes
CN: mymm : 2 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes
CN: localhost : 3 ms [SUCCESS] PBX: Yes VNETD: Yes BPCD: Yes
...snip...
Displaying all locally configured IP addresses
and if these addresses are found in the DNS
Address Local DNS Entry
------- ----- ----------
192.168.1.1 Yes No - private/autoconfigured
127.0.0.1 Yes No - private/autoconfigured
5.5.5.2 Yes No - private/autoconfigured
4.4.4.2 Yes No - private/autoconfigured
3.3.3.2 Yes No - private/autoconfigured
2.2.2.2 Yes No - private/autoconfigured
1.1.1.2 Yes Yes
PREFERRED_NETWORK rules are applied in this order:
[0] PREFERRED_NETWORK = 5.5.5.2 PROHIBITED
[1] PREFERRED_NETWORK = 192.168.1.1 PROHIBITED
The delay for each outbound legacy connection attempt is 10 seconds to allow for TCP try and retry the connection on networks that drop a few packets. Depending on which local network interfaces are routable to the target host, the delay to each specific host may vary. Typically bpbrm will connect to bpcd at least twice when starting a backup or restore, resulting in 40 seconds of delay (bpcd, vnetd, bpcd, vnetd). The delays will be multiplied if there are additional non-routable interfaces.
Application backup jobs, such as the one above, make additional connections and incur additional delays.
Solution
These delays can be avoided by using different or additional Preferred Network configurations.
If there is a local interface that should not be used, replace the prohibited interface with a match setting for that network to specify the local source binding. Typically this will be the default route on the host, but could be another interface. For the situation above:
PREFERRED_NETWORK = 192.168.1.0/24 MATCH 1.1.1.2
PREFERRED_NETWORK = 5.5.5.0/24 MATCH 1.1.1.2
Results in the following source binding lists, which let the OS routing table pick the source interface binding:
FL: myclient -> 4.4.4.3 : 1 ms FAST (< 5 sec) SRC: ANY
FL: mymaster -> 1.1.1.1 : 1 ms FAST (< 5 sec) SRC: ANY
FL: mymm -> 1.1.1.2 : 0 ms FAST (< 5 sec) [local] SRC: ANY
Alternatively, match settings can be implemented for the specific locally attached networks, with a catch-all that uses the default route.
PREFERRED_NETWORK = 2.2.2.0/24 MATCH 2.2.2.2
PREFERRED_NETWORK = 3.3.3.0/24 MATCH 3.3.3.2
PREFERRED_NETWORK = 4.4.4.0/24 MATCH 4.4.4.2
PREFERRED_NETWORK = 0/0 MATCH 1.1.1.2
Resulting in the following source binding lists:
FL: myclient -> 4.4.4.3 : 1 ms FAST (< 5 sec) SRC: 4.4.4.2
FL: mymaster -> 1.1.1.1 : 1 ms FAST (< 5 sec) SRC: 1.1.1.2
FL: mymm -> 1.1.1.2 : 0 ms FAST (< 5 sec) [local] SRC: ANY
Applies To
NetBackup 7.x server and client hosts with multiple network interfaces