Intermittent status 83 on SAN client backup when pipe open is delayed until after pipe close

Article: 100012495
Last Published: 2014-07-01
Ratings: 0 0
Product(s): NetBackup & Alta Data Protection

Problem

Oracle RMAN application backup jobs using SAN client across fiber transport (FT) have intermittent failures with status 83.   The failure happens after the job goes 'active' and before the transfer of the backup image begins.

 

The problem is more common when using two or more channels.   It typically does not affect the first application job to obtain FT resources, but can in some situations.

 

The problem may affect other database backups and/or file system backups.

 

Error Message

The Job Details will show the status 83 following a pipe open failure'.  The job was active, but never started transferring any data.

 

1:25:46 - Error bptm (pid=8176) Could not open FT Client pipe for client oadb1: pipe open failed (13369361)

1:25:48 - Info bptm (pid=8176) EXITING with status 83

1:25:38 - Info bphdb (pid=28982) done. status: 83: media open error

 

 

The nbftsrvr debug log shows pipe 3 was opened for the job (604955), and then unexpectedly closed 5 minutes later.

 

01:20:43.389 [Debug] NB 51216 nbftsrvr 199 ... [FTServerMgr_i::openPipe]  received for pipe 3 jobId 604955 shmId 832012302 owner 0 shmSize -1 MPX 0 bufferSegmentSize 131072 shmFlags 0 numBufferSegment 16

01:20:43.389 [Debug] NB 51216 nbftsrvr 199 ... [shmr_OpenPipe] shmr_OpenPipe id = 3 direction = 1 shmId = 832012302 shmSize = 4294967295 num = 16  bufferSize = 131072 mpx = 0 owner = 0 pid = 31195  shmCtrlId = -1 backupId = "myclient_1401499212" instance = 1401499212

...snip...

01:25:46.049 [Debug] NB 51216 nbftsrvr 199 ... [FTServerMgr_i::closePipe] pipe 3

 

 

The nbftclnt debug log shows that the pipe (3) for the job (604955) was closed at the same time, but that the pipe open was unexpectedly not received until 10+ minutes later.

 

01:25:46.676 [Debug] NB 51216 nbftclnt 200 ... [FTClientMgr_i::closePipe] jobid=604955, pipeId=3, on device SYMANTECFATPIPE 0.0 mymm returning 196626

...snip...

01:36:05.298 [Debug] NB 51216 nbftclnt 200 ... [PipeInit] FTClientOpenPipeErr_PIPE_NOT_OPEN

01:36:05.298 [Debug] NB 51216 nbftclnt 200 ... [OpenPipe] PipeInit failed with 196625: pPipe = 110a48870: pDevice = 1109190b0 "/dev/rmt2": pipeId 3: maxRetries 3: CorbanumBufSeg 16: CorbaBufSegSize 0: SingleThread 1: Synchronous 0

01:36:05.298 [Debug] NB 51216 nbftclnt 200 ... [~FTClientPipe]  Pipe ID 3 at 110a48870

01:36:05.298 [Debug] NB 51216 nbftclnt 200 ... [~FTClientPipe]  backup cnt 0 Pipe ID 3

01:36:05.298 [Debug] NB 51216 nbftclnt 200 ... [FTClientMgr_i::openPipe] Fatal error, failed to open the pipe for jobid=604955, on device SYMANTECFATPIPE 0.0 mymm, error = 196625

 

 

A TCP packet capture from a network device port adjacent to the master server shows that there are many TCP connections established from the master server to the client host.  However, only two of them are used for GIOP which transports the CORBA requests and replies between FSM and nbftclnt.  One of the connections has error free and nearly continuous exchange of data as database and transaction log backups occur every few minutes. 

 

However, the other connection shows an unexpected sequence.  After connection establishment there are two CORBA requests and replies.

 

Note: The clock on the capture host is not synced with the NetBackup hosts.

 

  1 3:09:41.66758  1.1.1.1  -> 3.3.3.3  TCP D=1556 S=35236 Syn Seq=3163902158 Len=0 ...

  2 3:09:41.66826  3.3.3.3 -> 1.1.1.1   TCP D=35236 S=1556 Syn Ack=3163902159 Seq=3197565071 Len=0 ...

  3 3:09:41.66852  1.1.1.1  -> 3.3.3.3  TCP D=1556 S=35236 Ack=3197565072 Seq=3163902159 Len=0 ...

  4 3:09:41.66878  1.1.1.1  -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565072 Seq=3163902159 Len=28 ...

                   > PBX service request, extension=nbFTClient

  5 3:09:41.66909  3.3.3.3 -> 1.1.1.1   TCP D=35236 S=1556 Ack=3163902159 Seq=3197565072 Len=0 ...

                   < TCP ACK+Window Update

  6 3:09:41.66970  3.3.3.3 -> 1.1.1.1   TCP D=35236 S=1556 Push Ack=3163902187 Seq=3197565072 Len=1 ...

                   < nbftclnt now owns connection

  7 3:09:41.66993  1.1.1.1  -> 3.3.3.3  TCP D=1556 S=35236 Ack=3197565073 Seq=3163902187 Len=0 ...

                   > TCP ACK (to TCP PUSH)

  8 3:09:41.67263  1.1.1.1  -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565073 Seq=3163902187 Len=153 ...

                   > GIOP Request id=1 _is_a

  9 3:09:41.67515  3.3.3.3 -> 1.1.1.1   TCP D=35236 S=1556 Push Ack=3163902340 Seq=3197565073 Len=25 ...

                   < GOIP Reply   id=1   No exception

 10 3:09:41.67853  1.1.1.1  -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565098 Seq=3163902340 Len=900 ...

                   > GIOP Request id=2 openPipe

 11 3:09:41.69277  3.3.3.3 -> 1.1.1.1   TCP D=35236 S=1556 Push Ack=3163903240 Seq=3197565098 Len=696 ...

                   < GIOP Reply   id=2   No exception

 12 3:09:41.73235  1.1.1.1  -> 3.3.3.3  TCP D=1556 S=35236 Ack=3197565794 Seq=3163903240 Len=0 ...

                   > TCP ACK (to TCP PUSH)

 

When the master server/FSM makes the third CORBA request, the TCP stack on the client host does not return a TCP ACK.  The TCP stack on the master server retransmitted the packet containing the request many times without receiving a TCP ACK from the TCP stack on the client host, much less an response from nbftclnt.  NetBackup on the master server will be unaware of any problems until either the TCP stack returns a fault or the application timers expire.

 

 13 4:57:29.16138  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

                   > GIOP Request id=3 _non_existent

 14 4:57:29.36332  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

                   > TCP retransmission with TCP PUSH

 15 4:57:29.76724  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

                   > TCP retransmission with TCP PUSH repeats 14 times...

 16 4:57:30.57508  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 17 4:57:32.19106  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 18 4:57:35.42339  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 19 4:57:41.88752  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 20 4:57:54.81578  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 21 4:58:20.67228  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 22 4:59:12.38500  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 23 5:00:55.81132  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 24 5:02:55.81372  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 25 5:04:55.81563  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 26 5:06:55.81831  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 27 5:08:55.82066  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 28 5:10:55.82295  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=3197565794 Seq=3163903240 Len=96 ...

 

At this point it looks like either the packet for Request id=3 is not reaching the client host or the TCP stack on the client host is not responding appropriately.  But notice the 6,467 seconds of idle time between the Reply to id=2 and the Request for id=3.  During that time the other connection was able to handle all of the CORBA requests. 

 

A TCP packet capture from a network device port adjacent to the client host shows the same packet exchanges up through the third TCP ACK.  But notice that after the 6,467 seconds of idle time, the TCP packet containing the next CORBA Request is not seen.  Some networking component between the captures discarded the packet and all its retransmissions without returning a TCP Reset to the master server host.

 

Note: The clock on the capture host is not synced with the NetBackup hosts, but it is synced with the other capture and may be from a different port on the same device.

 

  1 3:09:41.66764  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Syn Seq=327149580 Len=0 ...

  2 3:09:41.66826  3.3.3.3 -> 1.1.1.1  TCP D=35236 S=1556 Syn Ack=327149581 Seq=2326494960 Len=0 ...

  3 3:09:41.66854  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Ack=2326494961 Seq=327149581 Len=0 ...

  4 3:09:41.66880  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=2326494961 Seq=327149581 Len=28 ...

                   > PBX service request, extension=nbFTClient

  5 3:09:41.66907  3.3.3.3 -> 1.1.1.1  TCP D=35236 S=1556 Ack=327149581 Seq=2326494961 Len=0 ...

                   < TCP ACK+Window Update

  6 3:09:41.66968  3.3.3.3 -> 1.1.1.1  TCP D=35236 S=1556 Push Ack=327149609 Seq=2326494961 Len=1 ...

                   < nbftclnt now owns connection

  7 3:09:41.66993  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Ack=2326494962 Seq=327149609 Len=0 ...

                   > TCP ACK (to TCP PUSH)

  8 3:09:41.67263  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=2326494962 Seq=327149609 Len=153 ...

                   > GIOP Request id=1 _is_a

  9 3:09:41.67515  3.3.3.3 -> 1.1.1.1  TCP D=35236 S=1556 Push Ack=327149762 Seq=2326494962 Len=25 ...

                   < GIOP Reply   id=1   No exception

 10 3:09:41.67853  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Push Ack=2326494987 Seq=327149762 Len=900 ...

                   > GIOP Request id=2 openPipe

 11 3:09:41.69277  3.3.3.3 -> 1.1.1.1  TCP D=35236 S=1556 Push Ack=327150662 Seq=2326494987 Len=696 ...

                   < GIOP Reply   id=1   No exception

 12 3:09:41.73235  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Ack=2326495683 Seq=327150662 Len=0 ...

                   > TCP ACK (to TCP PUSH)

 

At 7,330 seconds after the third TCP ACK (and 863 seconds after the Request id=3 packet went missing) a TCP Keepalive packet is received from the client host TCP stack indicating that it still had the connection ESTABLISHED and awaiting requests.

 

 13 5:11:51.67626  3.3.3.3 -> 1.1.1.1  TCP D=35236 S=1556 Ack=327150661 Seq=2326495682 Len=1 ...

                   < TCP Keepalive

 

Immediately thereafter some device in the network returned a TCP Reset instead of a TCP ACK indicating that the connection is no longer established end-to-end.  The TCP Reset did not come from the TCP stack on the master server host because otherwise it would have been seen in the capture from the other side of the network.

 

 14 5:11:51.67632  1.1.1.1 -> 3.3.3.3  TCP D=1556 S=35236 Rst Ack=2326495683 Seq=327150661 Len=0 ...

                   > TCP Reset

 

Both captures contain other traffic on other connections until 06:06:00.  That indicates that the TCP Keepalive did not make it through the network to the master server.  Some device within the network has dropped the connection and is responding only to the TCP stack on the client host, not to the TCP stack on the master server host.

 

Cause

There is a firewall or other stateful TCP device in the network that is dropping the connection between the master server and client host.  Thereafter, it does not respond to the TCP traffic across that connection from the master server host for some reason.

 

As a result, the pipe open request from nbftsrvr on the media server reaches FSM on the master server, but does not reach nbftclnt on the client host.  After 5 minutes, nbftsrvr times out and bptm fails the job with status 83.

 

Because the TCP connection is hung waiting for the pipe open request, the pipe close uses a different TCP connection that is not hung and is completed successfully.

 

The initial CORBA connection from master server to client host gives up on the TCP connection after Client Connect Timeout expires and establishes a new TCP connection which then delivers the pipe open request.  That request fails immediately because the job has already ended.

 

Note: The delay between pipe close and pipe open may vary depending on the timeouts configured.

 

This problem is most prevalent when multiple concurrent jobs are active to the client.  That is because CORBA requests, if they occur in series, will use just one TCP connection.  An additional connection is only established if a new CORBA request is issued while another request is outstanding on the existing connection. 

 

In the situation above, multiple stream backups were running regularly throughout the day.  At some point, the timing of the backups caused a second TCP connection to be established for CORBA communication between the master and client hosts.  It was used for some time before the initial connection was able to resume handling all the subsequent requests.  Sometime thereafter the firewall appears to have dropped the second connection silently, and did not respond when the master server next attempted to reuse it.

 

Solution

This issue can be resolved several ways.

 

The first possibility is to review the configuration of network firewalls, load balancers, switches and other devices and make sure that they are not dropping connections unintentionally or prematurely.   If they do, ensure that they are responding appropriately to subsequent TCP packets on those connections.  

 

Alternatively use a network route that bypasses the device that is dropping the connections.

 

Alternatively, if the connections are being dropped due to idle timeout, this problem can be avoided by decreasing the TCP keepalive interval on the master server, media server, and client hosts.   NetBackup sets SO_KEEPALIVE on the TCP connections used for CORBA requests.   If the operating systems send the TCP keepalives at an interval that is more frequent than the idle timeout within the network, then the connections will not be dropped.   See 000087737 for the commands to use and parameters to change.

 


Applies To

Any NetBackup version; including 7.5.0.6, 7.5.0.7.

 

Any host operating system; including Red Hat 6.4, HP-UX 11.31 IA64.

 

References

Etrack : 3452968 Etrack : 3519213

Was this content helpful?