Please enter search query.
Search <product_name> all support & community content...
Article: 100001028
Last Published: 2010-01-08
Ratings: 1 0
Product(s): NetBackup & Alta Data Protection
Problem
Backup generates Status 40 in Activity Monitor and Status 150 on the client.Solution
Overview:
A backup job runs for two hours andthen fails with Status 40 in Activity Monitor. Examination of Media Serverand Client logs reveals a Status 150.
Troubleshooting:
Turn uplogging for the NetBackup Job Manager (nbjm = OID 117) on the MasterServer:
vxlogcfg -a -p NB -o 117 -s "DebugLevel=6"
vxlogcfg -a -p NB -o117 -s "DiagnosticLevel=6"
Create logging folders on the Media Server forbpcd, bptm and bpbrm at VERBOSE =5:
...\NetBackup\logs\bpcd
...\NetBackup\logs\bptm
...\NetBackup\logs\bpbrm
Createlogging folder on the Client for bpbkar at VERBOSE =5:
...\NetBackup\logs\bpbkar
Rerun the backup job and examinethe logs.
From the Master Server:
Convert the nbjm raw log totext output:
vxlogview -d all -i 117 > nbjm.txt
Search nbjm.txt forthe JobID as seen in the Activity Monitor and see what leads up to the jobcancellation:
In this example, everything looks fine in nbjm up until12:44:
1/19/2010 12:44:07.771 [Debug] NB 51216 nbjm 117 PID:880TID:3876 File ID:117 [ jobid=13510 parentid=13510] 6[BackupJob::updateJobInstance] --- EXITING --- : obj = 0000000002CAD3B0(BackupJob.cpp:2313)
1/19/2010 12:44:07.771 [Application] NB 51216nbjm 117 PID:880 TID:3876 File ID:117 [reqid=-1263919625] [Info] V-117-49received WRITE DONE message: Msg: type = 39 BRMWroteMsg:backupid=tnaspr04_1263920501, Kbytes written=20032, End of Data=0, Kbytes persec=3787.86, copy=0
Then exactly 2 hours later nbjm terminates the jobwith Status 40, as seen in the Activity Monitor:
1/19/2010 14:44:07.949 [Debug] NB 51216 nbjm 117 PID:880 TID:2712 File ID:117[reqid=-1263919625] 4 [BRMComm: :terminateAllJobs] (0000000002C906F0) jobid=13510 (BRMComm.cpp:2014)
1/19/2010 14:44:07.949 [Debug]NB 51216 nbjm 117 PID:880 TID:2712 File ID:117 [No context] 4[JobNotification::~JobNotification] (0000000002CF3710) notify and closejobid=13510 status=40(BRMComm.cpp:2891)
From the MediaServer:
bptm is writing data, when it gets the Terminate Eventfrom nbjm:
14:07:33.229 [ 5288.3248] <2> write_data: Total Kbytestransferred 8994368
14:47:03.995 [ 5288.3248] <2>mm_check_child_write: Received Terminate Event
14:47:04.026[ 5288.3248] <2> wait_for_sigcld: waiting for child 4112 to exit,timeout is 300
bpcd records the inbound terminate event from theMaster Server's IP Address for bptm PID 5288:
14:46:58.073 [2152.2468]<2> logconnections: BPCD ACCEPT FROM 10.12.12.69.52596 TO10.14.0.101.13724
...
14:47:03.995 [2152.2468] <2> bpcd main:BPCD_SEND_SIGNAL_TO_PID
14:47:03.995 [2152.2468] <4> bpcd main: setterminate event for pid 5288
bpbrm records the terminaterequest from bptm:
14:48:29.011 [2016.3512] <2> bpbrm check_for_terminate: exit_status = 150
14:48:29.011 [2016.3512]<2> bpbrm check_for_terminate: child exit abnormal - status =150
14:48:29.011 [2016.3512] <2> bpbrm kill_child_process_Ex:start
14:48:29.011 [2016.3512] <2> inform_client_of_status: INF -Server status = 150
From the Client:
bpbkar records aterminate from bpbrm:
2:49:04.066 PM: [3828.6584] <4>tar_backup::readServerMessage: INF - keepalive message received
2:49:04.066PM: [3828.6584] <4> bpio::read_string: INF - read non-blocking message oflength 25
2:49:04.066 PM: [3828.6584] <4>tar_backup:: readServerMessage: INF - 'INF - Server status = 150'received
Explanation:
In this example, the idle sockettimeout was set to one hour on the firewall between the Master Server and theMedia Server and the socket was dropped at that point while the backup was stillin progress.
Note that the Windows OS on the Master Server has adefault TCP KeepAliveTime of 2 hours. The KeepAliveTime parameter is designed tosend a KeepAlive over an established but idle socket at the specified timeinterval specifically to prevent the socket from being dropped.
Thesequence of events in the above example is a follows:
1. The socket betweennbjm and bpbrm is established and then goes idle while the backup is inprogress. The bpbrm process on the media server will not try to use thesocket until the data transfer completes. The nbjm process will post aread system call and wait for the exit status to arrive.
2. One hour into thebackup job the firewall drops the socket. The OS TCP stacks on both hostswill be unaware and not inform the application programs.
3. A second hourpasses, and the Microsoft TCP stack tries to send a TCP KeepAlive over thesocket to the remote TCP stack.
4. The KeepAlive fails because the socket isno longer connected through the firewall and the read system call returns tonbjm with an error indication.
5. nbjm then terminates bptm.
6. Beforeexiting, bptm notifies bpbrm which then also terminatesbpbkar.
Solution:
In order to correct this behavior andallow the job to run to completion, the TCP KeepAliveTime interval on the MasterServer must be reduced to a value lower then the idle socket timeout valueconfigured on the firewall. In this example, the firewall was confirmed tohave a 60 minute idle socket timeout. The Windows TCP KeepAliveTime wasreduced to 40 minutes and the problem was resolved.
See this MicrosoftArticle for information on the TCP/IP Configuration Parameter"KeepAliveTime":
http://technet.microsoft.com/en-us/library/cc739819(WS.10).aspx
A backup job runs for two hours andthen fails with Status 40 in Activity Monitor. Examination of Media Serverand Client logs reveals a Status 150.
Troubleshooting:
Turn uplogging for the NetBackup Job Manager (nbjm = OID 117) on the MasterServer:
vxlogcfg -a -p NB -o 117 -s "DebugLevel=6"
vxlogcfg -a -p NB -o117 -s "DiagnosticLevel=6"
Create logging folders on the Media Server forbpcd, bptm and bpbrm at VERBOSE =5:
...\NetBackup\logs\bpcd
...\NetBackup\logs\bptm
...\NetBackup\logs\bpbrm
Createlogging folder on the Client for bpbkar at VERBOSE =5:
...\NetBackup\logs\bpbkar
Rerun the backup job and examinethe logs.
From the Master Server:
Convert the nbjm raw log totext output:
vxlogview -d all -i 117 > nbjm.txt
Search nbjm.txt forthe JobID as seen in the Activity Monitor and see what leads up to the jobcancellation:
In this example, everything looks fine in nbjm up until12:44:
1/19/2010 12:44:07.771 [Debug] NB 51216 nbjm 117 PID:880TID:3876 File ID:117 [ jobid=13510 parentid=13510] 6[BackupJob::updateJobInstance] --- EXITING --- : obj = 0000000002CAD3B0(BackupJob.cpp:2313)
1/19/2010 12:44:07.771 [Application] NB 51216nbjm 117 PID:880 TID:3876 File ID:117 [reqid=-1263919625] [Info] V-117-49received WRITE DONE message: Msg: type = 39 BRMWroteMsg:backupid=tnaspr04_1263920501, Kbytes written=20032, End of Data=0, Kbytes persec=3787.86, copy=0
Then exactly 2 hours later nbjm terminates the jobwith Status 40, as seen in the Activity Monitor:
1/19/2010 14:44:07.949 [Debug] NB 51216 nbjm 117 PID:880 TID:2712 File ID:117[reqid=-1263919625] 4 [BRMComm: :terminateAllJobs] (0000000002C906F0) jobid=13510 (BRMComm.cpp:2014)
1/19/2010 14:44:07.949 [Debug]NB 51216 nbjm 117 PID:880 TID:2712 File ID:117 [No context] 4[JobNotification::~JobNotification] (0000000002CF3710) notify and closejobid=13510 status=40(BRMComm.cpp:2891)
From the MediaServer:
bptm is writing data, when it gets the Terminate Eventfrom nbjm:
14:07:33.229 [ 5288.3248] <2> write_data: Total Kbytestransferred 8994368
14:47:03.995 [ 5288.3248] <2>mm_check_child_write: Received Terminate Event
14:47:04.026[ 5288.3248] <2> wait_for_sigcld: waiting for child 4112 to exit,timeout is 300
bpcd records the inbound terminate event from theMaster Server's IP Address for bptm PID 5288:
14:46:58.073 [2152.2468]<2> logconnections: BPCD ACCEPT FROM 10.12.12.69.52596 TO10.14.0.101.13724
...
14:47:03.995 [2152.2468] <2> bpcd main:BPCD_SEND_SIGNAL_TO_PID
14:47:03.995 [2152.2468] <4> bpcd main: setterminate event for pid 5288
bpbrm records the terminaterequest from bptm:
14:48:29.011 [2016.3512] <2> bpbrm check_for_terminate: exit_status = 150
14:48:29.011 [2016.3512]<2> bpbrm check_for_terminate: child exit abnormal - status =150
14:48:29.011 [2016.3512] <2> bpbrm kill_child_process_Ex:start
14:48:29.011 [2016.3512] <2> inform_client_of_status: INF -Server status = 150
From the Client:
bpbkar records aterminate from bpbrm:
2:49:04.066 PM: [3828.6584] <4>tar_backup::readServerMessage: INF - keepalive message received
2:49:04.066PM: [3828.6584] <4> bpio::read_string: INF - read non-blocking message oflength 25
2:49:04.066 PM: [3828.6584] <4>tar_backup:: readServerMessage: INF - 'INF - Server status = 150'received
Explanation:
In this example, the idle sockettimeout was set to one hour on the firewall between the Master Server and theMedia Server and the socket was dropped at that point while the backup was stillin progress.
Note that the Windows OS on the Master Server has adefault TCP KeepAliveTime of 2 hours. The KeepAliveTime parameter is designed tosend a KeepAlive over an established but idle socket at the specified timeinterval specifically to prevent the socket from being dropped.
Thesequence of events in the above example is a follows:
1. The socket betweennbjm and bpbrm is established and then goes idle while the backup is inprogress. The bpbrm process on the media server will not try to use thesocket until the data transfer completes. The nbjm process will post aread system call and wait for the exit status to arrive.
2. One hour into thebackup job the firewall drops the socket. The OS TCP stacks on both hostswill be unaware and not inform the application programs.
3. A second hourpasses, and the Microsoft TCP stack tries to send a TCP KeepAlive over thesocket to the remote TCP stack.
4. The KeepAlive fails because the socket isno longer connected through the firewall and the read system call returns tonbjm with an error indication.
5. nbjm then terminates bptm.
6. Beforeexiting, bptm notifies bpbrm which then also terminatesbpbkar.
Solution:
In order to correct this behavior andallow the job to run to completion, the TCP KeepAliveTime interval on the MasterServer must be reduced to a value lower then the idle socket timeout valueconfigured on the firewall. In this example, the firewall was confirmed tohave a 60 minute idle socket timeout. The Windows TCP KeepAliveTime wasreduced to 40 minutes and the problem was resolved.
See this MicrosoftArticle for information on the TCP/IP Configuration Parameter"KeepAliveTime":
http://technet.microsoft.com/en-us/library/cc739819(WS.10).aspx
Key:Tcpip\Parameters\KeepAliveTime
Value Type: REG_DWORD-time inmilliseconds
Valid Range:1-0xFFFFFFFF
Default: 7,200,000 (twohours)
Description: The parametercontrols how often TCP attempts to verify that an idle connection is stillintact by sending a keep-alive packet. If the remote system is still reachableand functioning, it acknowledges the keep-alive transmission. Keep-alive packetsare not sent by default. This feature may be enabled on a connection by anapplication.
Note: A reboot is required to utilize theupdated KeepAliveTime registry key setting.