Status 0 on Application backups, but parent job fails with status 40 before or after

Article: 100019850
Last Published: 2014-02-24
Ratings: 0 0
Product(s): NetBackup & Alta Data Protection

Problem

The automatic job that runs the backup script for DB2, Informix, Oracle, SAP, or Sybase has two network socket between nbjm on the master server and bpbrm on the mediaserver.   These sockets are idle, awaiting script exit status, while the application backup jobs complete the actual backup.   If the TCP stacks on the hosts do not send TCP keepalives at a sufficiently frequent interval, a firewall or router between the hosts may drop the idle sockets which can lead to three different failure scenarios.

 

In the first scenario, the idle socket is dropped by the firewall, then the backup script exits, and then the TCP stack on the master server attempts to send a TCP keepalive on the closed socket.   In this case, all of the application backup jobs will complete successfully.   But upon script exit, bpbrm will be unable to return exit status to the master server.   At a later time when the keepalive is attempted to be sent on the closed socket, nbjm will fail the job with a status 40.

 

In the second scenario, the idle socket is dropped by the firewall, then the TCP stack on the master server attempts to send a TCP keepalive on the closed socket, and then the backup script exits. In this case, nbjm will receive the EPIPE from the O/S when the keepalive send fails and post the status 40 to the automatic job.   The application backup jobs will run unaffected and complete with status 0 later in the day.

 

In the third scenario, the idle socket timeout on the firewall is set the exact same value as the TCP keepalive interval.   In this case, the first TCP keepalive will be sent 2 hours after the socket was established.   But since the socket was used briefly, the first keepalive will be sent before the socket has been idle for 2 hours and it will reset the idle socket timeout on the firewall.   Then at approximately 4 hours the socket will likely drop unless the timing is such that the keepalive arrives a few milliseconds before the idle socket timeout.   The results is the same as for the second scenario, the timing of the status 40 may be at the second or latter iteration of the TCP keepalive interval.

 

Note: In both scenarios, the application backup jobs will not be affected and will complete successfully and the images can be restored.   However, the automatic backup job will fail with a status 40 and rerun the backup script resulting in a redundant backup, possibly before the initial execution of the backup script has completed.

 

Note: The default TCP keepalive interval on most operating systems is 2 hours, but it can vary so check the configuration.   In addition, the O/S will not drop the socket until any configured TCP keepalive retries have been exhausted, so the status 40 may not post until 0 to 10 minutes after the first TCP keepalive did not receive a response.

Error Message

40: network connection broken

 

Cause

The following evidence shows the first scenario where the firewall is configured to drop idle sockets after 1 hour.  The master server is configured to send TCP keepalives every 7200 seconds.

 

The bphdb debug log shows the backup script took 01:28:08 to execute and exited with status 0.

 

10:43:08.678 [7325] <2> logparams: -sb -rdbms oracle ... -b myclient_1221446343 -jobid 249061 ...

...snip...

10:43:08.760 [7325] <4> bphdb do_backup: INF - Processing /mybackup.sh

...snip...

12:11:19.011 [7325] <4> bphdb Exit: INF - EXIT STATUS 0: the requested operation was successfully completed

 

The bpbrm debug log shows an elapse of 01:28:13 from the start to the receipt of exit status from bphdb.

 

10:43:07.155 [7310] <2> logparams: -backup ... -b myclient_1221446343 -jobid 249061 ...

...snip...

10:43:08.760 [7310] <2> write_file_names: buffering file name '/mybackup.sh' for output

...snipped waiting for the backup script to complete...

12:11:19.175 [7310] <2> logconnections: BPDBM CONNECT FROM 1.1.1.2.60354 TO 1.1.1.1.13724

...snip...

12:11:19.336 [7310] <2> bpbrm main: validating image for client myclient

...snip...

12:11:20.828 [7310] <2> bpbrm Exit: client backup EXIT STATUS 0: the requested operation was successfully completed

 

But the nbjm debug log doesn't show the end of job until a status 40 almost exactly 2 hours after bpbrm was started, which is more than 30 minutes after bphdb and bpbrm have exited.

 

10:43:07.495 [Application] sending command bpbrm -backup ... -b myclient_1221446343 -jobid 249061 ...

...snip...

12:43:08.415 [Debug] [CallbackQueue::queueRequest] queueing JL updateJobStatus : jobid=249061, birthtime=1221446343, status=40 -- retry count=-1(CallbackQueue.cpp:1212)

 

A network trace confirmed that the master server TCP stack sent the TCP keepalive at the 7200 second interval, but did not receive a reply.

 

 

The following evidence shows the third scenario where the first TCP keepalive was sent just a few milliseconds before the idle socket timeout expired, but the second one did not arrive in time and then there was 8 minutes of keepalive retries before the status 40.

 

The bpbrm debug log confirms the backup script exited after 10 hours but that bpbrm was unable to send status to the master server due to the socket being closed.

 

12:21:05.883 [17930] <2> logparams: -backup -ct 4 -cl mypolicy -sched Full ...

...snip...

12:21:07.428 [17930] <2> write_file_names: buffering file name '/oracle_backup.sh' for output

...snip...

22:21:10.062 [17930] <2> put_strlen_str: cannot write data to network:  Broken pipe (32)

22:21:10.062 [17930] <16> bpbrm Exit: could not write EXIT STATUS to stderr

...snip...

22:21:10.477 [17930] <2> bpbrm Exit: client backup EXIT STATUS 0: the requested operation was successfully completed

 

The netstat output from the master server host shows the sockets closed after approximately 248 minutes by the firewall. 

 

Tue Nov  2 12:20:34 EDT 2010

Tue Nov  2 12:21:04 EDT 2010

Tue Nov  2 12:21:33 EDT 2010

1.1.1.1.41026 2.2.2.2.13724 49640      0 49640      0 ESTABLISHED

1.1.1.1.41028 2.2.2.2.13724 49640      0 49640      0 ESTABLISHED

Tue Nov  2 12:22:02 EDT 2010

1.1.1.1.41026 2.2.2.2.13724 49640      0 49640      0 ESTABLISHED

1.1.1.1.41028 2.2.2.2.13724 49640      0 49640      0 ESTABLISHED

...snipped 501 occurrences of the same sockets and states...

Tue Nov  2 16:28:19 EDT 2010

1.1.1.1.41026 2.2.2.2.13724 49640      0 49640      0 ESTABLISHED

1.1.1.1.41028 2.2.2.2.13724 49640      0 49640      0 ESTABLISHED

Tue Nov  2 16:28:49 EDT 2010

1.1.1.1.41026 2.2.2.2.13724 49640      0 49640      0 ESTABLISHED

1.1.1.1.41028 2.2.2.2.13724 49640      0 49640      0 ESTABLISHED

Tue Nov  2 16:29:18 EDT 2010

…sockets are closed long before bpbrm has returned status…

Tue Nov  2 16:29:48 EDT 2010

Tue Nov  2 16:30:17 EDT 2010

 

The snoop output from the master server confirms that TCP keepalives were sent on both sockets after 2 hours and acknowledgements received.  TCP keepalives were sent on both sockets at 4 hours, but responses were not received.  After 8 minutes, retry attempts ceased.

 

No.   Time        Source           Destination     Prot Info      

...snipped socket setup and initial data transfer.

  128 6.591679    1.1.1.1   2.2.2.2  TCP  41026 > vnetd [PSH, ACK] Seq=501 Ack=7 Win=49640 Len=12

  129 6.591961    2.2.2.2   1.1.1.1  TCP  vnetd > 41026 [ACK] Seq=7 Ack=513 Win=49640 Len=0

  130 6.677442    1.1.1.1   2.2.2.2  TCP  41028 > vnetd [ACK] Seq=12 Ack=109 Win=49640 Len=0

  151 7.061178    2.2.2.2   1.1.1.1  TCP  vnetd > 41028 [PSH, ACK] Seq=109 Ack=12 Win=49640 Len=24

  163 7.177495    1.1.1.1   2.2.2.2  TCP  41028 > vnetd [ACK] Seq=12 Ack=133 Win=49640 Len=0

  201 7.619933    2.2.2.2   1.1.1.1  TCP  vnetd > 41028 [PSH, ACK] Seq=133 Ack=12 Win=49640 Len=49

  212 7.737442    1.1.1.1   2.2.2.2  TCP  41028 > vnetd [ACK] Seq=12 Ack=182 Win=49640 Len=0

  227 8.074759    2.2.2.2   1.1.1.1  TCP  vnetd > 41028 [PSH, ACK] Seq=182 Ack=12 Win=49640 Len=48

  228 8.187444    1.1.1.1   2.2.2.2  TCP  41028 > vnetd [ACK] Seq=12 Ack=230 Win=49640 Len=0

52406 7206.285012 1.1.1.1   2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

52407 7206.285423 2.2.2.2   1.1.1.1  TCP  [TCP Keep-Alive ACK] vnetd > 41026 [ACK] Seq=7 Ack=513 Win=49640 Len=0

52408 7207.764962 1.1.1.1   2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

52409 7207.765626 2.2.2.2   1.1.1.1  TCP  [TCP Keep-Alive ACK] vnetd > 41028 [ACK] Seq=230 Ack=12 Win=49640 Len=0

71865 14405.982484 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

71866 14407.462501 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

71867 14410.042466 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

71868 14412.292472 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

71869 14418.152486 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

71937 14421.932482 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

71938 14434.362480 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

72110 14441.192475 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

72181 14466.762484 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

72182 14479.692498 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

72433 14526.782483 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

72434 14539.702484 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

72572 14586.796212 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

72576 14599.722466 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

72650 14646.812485 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

72651 14659.742574 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

72789 14706.832486 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

72790 14719.762471 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

72861 14766.852500 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

72862 14779.782553 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

73071 14826.872465 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41026 > vnetd [PSH, ACK] Seq=512 Ack=7 Win=49640 Len=1

73075 14839.802475 1.1.1.1  2.2.2.2  TCP  [TCP Keep-Alive] 41028 > vnetd [PSH, ACK] Seq=11 Ack=230 Win=49640 Len=1

 

The nbjm debug log confirms that it began to terminate the job following the socket read failure when the keepalives were not successful.

 

16:29:18.657 [Debug] PID:13418 TID:9 [reqid=-1287725941] 4 [BRMComm::terminateAllJobs] (101fb7e80) jobid=385103(../BRMComm.cpp:2008)

16:29:18.666 [Debug] PID:13418 TID:9 [No context] 4 [JobNotification::~JobNotification] (10213abc0) notify and close jobid=1385103 status=40(../BRMComm.cpp:2885)

 

Solution

These options can be implemented either alone or in combination to effect an appropriate solution.

 

1) Increase the idle socket timeout on the firewall to an interval that is either long enough to allow the backup script to complete or to allow TCP keepalives to be sent, i.e greater than 2 hours.

 

2) Decrease the TCP keepalive interval on the master server host to a value that is less than one half the idle socket timeout.   Use a network sniffer or similar software (e.g. WireShark, snoop, tcpdump) to confirm that TCP keepalives are being transmitted at the expected intervals.   The following commands will display the current settings.

 

AIX:

$ no -a | grep keep

 

HP-UX and Solaris:

$ ndd -get /dev/tcp tcp_keepalive_interval

 

Linux:

$ sysctl -a | grep keep

 

3) If the idle socket timeout cannot be increased on the firewall, and the TCP keepalive interval cannot be decreased on the master server, but the TCP keepalive interval can be decreased on the mediaserver, then it is possible for SO_KEEPALIVE to be set on the mediaserver end of the sockets so that the mediaserver host can send keepalives.  Be sure the interval is set to a value that is less than the idle socket timeout.

 


Was this content helpful?