Description
This article explains how NetBackup uses tape drives. Below we describe how to troubleshoot and resolve some common tape drive issues.
NetBackup does not write data directly to a tape drive, it uses operating system (OS) commands. For example: On Solaris, NetBackup uses the st tape driver to write the data to the tape. NetBackup can define the block size, but the commands used are at the OS level.
The SCSI pass-through driver (sg driver on Solaris) - allows SCSI commands to be passed directly to the drive. For example, the test-unit-ready SCSI command is used to mount a tape.
Most drive or tape issues have a cause outside of NetBackup. Thus, we recommend starting to troubleshoot at the OS and hardware level. Always check the OS system or event logs for errors.
First, reboot the library or drives reporting an issue. Then if the issue remains reboot the associated servers. Many of the errors mentioned in this document can be cleared this way. Even if this does not clear the issue, it has at least been eliminated from being the cause.
Remember that, even when NetBackup reports an error, it does not mean that NetBackup is the cause.
Common drive issues address in this document are:
- Scan command issues
- TAPE_ALERT
- ASC/ASCQ
- Missing Path
- Positioning errors
- Read/ Write errors
- I/O Errors
- External event has caused rewind
- Tapes not reaching expected capacity
- Tapes being incorrectly marked as 'read only'
- Library Inventory Issues
- Robot load issue - "Error bptm error requesting media TpErrno = Robot operation failed"
- Missing drives, or drives disappearing and reappearing
- Tapes failing to mount in NetBackup, but visible and usable by operating system commands
- Issues moving tapes to/ from slots or drives
- Issues with Cartridge memory
- Cleaning tape
- Status 277
- Phase 2 Import fails - Windows only
Scan Command
Symptoms:
The scan command shows no devices at all, or devices vanish and re-appear when the command is run many times.
Whilst the scan command is supplied with NetBackup, it does not interact with NetBackup at all. When scan is run, it sends OS SCSI commands to the devices, and the device's direct output is shown. There are no settings, tuning, or troubleshooting that can be done on the scan command itself from within NetBackup.
Check that the OS can both see, and send commands to the tape drives. It is not enough just to see them in Device Manager (Windows) or cfgadm (Solaris). These utilities don't show if the devices are correctly configured and able to both send and receive information.
For example, devices may be visible to the OS, but SAN issues cause communication errors. This will cause the scan command to fail.
If devices permanently vanish, it may be worth reconfiguring the passthrough driver (only used on Unix). If the OS set-up of the drives has not changed this is unlikely to be the issue. Even so, it can be worth removing this as a possible cause. See the device configuration guide for information on how to do this. If the scan command shows devices vanishing and reappearing, then the passthrough driver is not the cause.
If shared drives keep vanishing and re-appearing, check that no backups are running. The SCSI reservation of a drive may stop the drive from showing in the output of the scan command.
If the above hasn't solved the issue then it is caused by either the SAN, firmware, hardware, or vendor drivers. Check the SAN infrastructure (e.g. switches), HBAs (including configuration files), and the physical drive or library. Your OS or SAN administrators or hardware vendors should be able to help to find the root cause.
Known Issues:
(1) Some 6GB SAS HBAs are not compatible with mpt_sas driver.. See Oracle's article: https://docs.oracle.com/cd/E19253-01/821-0382/821-0382.pdf or 100003058.
(2) There is a known issue on Linux where 'scan' output shows "-" in the Device Name.
Device Name: "-"
Passthru Name: "/dev/sg4"
This is caused by a HBA that hasn't been used before with NetBackup has a different SYSFS path on Linux (a SAS HBA.) The way that Media Manager traverses the SYSFS file system is to use string matching when looking for paths. This didn't work correctly with this new HBA. This is addressed in ET 3954293.
TapeAlert/Tape Alert
Symptoms:
A TapeAlert message is seen in the NetBackup GUI or logs.
Example:
Oct 11 08:59:31 media bptm[3771]: [ID 228150 daemon.warning] TapeAlert Code: 0x03, Type: Warning, Flag: HARD ERROR, from drive TLD0_LTO4_DRIVE1 (index 4), Media Id R0TP01
A TapeAlert message or alert is due to a tape drive or library hardware or media event. These alerts are kept on the tape drive or library. NetBackup queries the tape device or library for TapeAlerts and then displays them. TapeAlerts are shown in the NetBackup bptm log. (See 100000622 for more information on TapeAlerts and codes)
While NetBackup displays TapeAlerts, they are due to a tape drive or library hardware issue, not a NetBackup issue. Check the Event Viewer or system log for errors. Your hardware vendor should be able to help to find the root cause.
ASC/ ASCQ
Symptoms:
Sense Key messages are seen in the logs or console.
In this example, robtest was failing to load a tape into a drive.
Initiating MOVE_MEDIUM from address 1000 to 500
move_medium failed, CHECK CONDITION
sense key = 0x5, asc = 0x30, ascq = 0x0, INCOMPATIBLE MEDIUM INSTALLED
This output can be broken down as follows:
Sense Key 0x5 = Illegal Request
ASC/ASCQ 0x30/00 = Incompatible Medium Inserted
Similar to TapeAlerts, SCSI Sense Keys are produced by the device, not by NetBackup. As ASC/ASCQ alerts are sent from the hardware, they are not created by NetBackup.
A reboot of the drive (not a soft reset) may clear ASC/ASCQ errors.
More information on these values can be found at https://www.t10.org. Your hardware vendor should be able to help to find the root cause.
Note
If NetBackup Key Management Service (KMS) is in use, drives could send out ASC/ASCQ errors relating to encryption. Whilst the drive is sending the message, in only this case, the cause may be the KMS service.
Missing Path
Symptom:
Missing path is seen in the drive details, or jobs fail with missing path errors.
Enter option:
Id DriveName Type Residence
Drive Path Status
****************************************************************
1 Drive1 hcart2 TLD(3) DRIVE=1
MISSING_PATH:4:0:0:2:SG11028122 DOWN
Missing path means that the OS cannot see the drives. The devices will have also vanished from the scan output, as the scan command works at the OS level.
NetBackup is not the cause but, when the issue is fixed, the paths to the devices may have changed. If this is the case, the devices will need to be deleted and the new paths entered in NetBackup. If the devices come back with the same OS paths, then no more action should be needed.
Positioning Errors
Symptom:
BPTM reports errors positioning the tape, this can be seen in the logs or the GUI.
Positioning errors occur when the operating system is unable to position, fast-forward or rewind the tape. The error message seen may differ slightly, depending on when the error occurs.
Example 1:
<2> write_data: block position check: actual 62504, expected 31254
Example 2:
1/11/2010 7:50:13 AM - Error bptm (pid=3364) ioctl (MTREW) failed on media id W00229, drive index 0, The I/O bus was reset. (1111) (bptm.c.8039)
NetBackup asks the OS to position the tape at various points in the backup. So, while reported by NetBackup, it is most commonly caused by:
- Hardware error
- Tape error
- Driver issue
- Firmware issue
As the OS, rather than NetBackup, positions tapes, your hardware vendor should be able to help to find the root cause.
Note:
1. There was a known issue in NetBackup 6.5.6 to 7.0.1 which caused the error:
Error bptm (pid=2164) ioctl (MTWEOF) failed on media id V01497, drive index 0, The physical end of the tape has been reached.
EEB 2182228 resolves this issue. If you see this issue in a later version of NetBackup (after 7.0.1), then the issue is with the firmware or hardware.
2. Between NetBackup 6.5.6 - 7.1.0.3 duplications of MPX backups may result in a positioning error / status 94. To investigate this Veritas suggests logging a support call and quoting Etrack 2229875
Read/Write errors
Symptom:
Errors seen with read or write issues.
Example 1:
write_data: cannot write image to media id XXXXXX, drive index #, Data error (cyclic redundancy check).
Example 2:
io_write_block: write error on media id MIR107, drive index 0, writing header block, 1117
Example 3:
Error bptm (pid=5268) cannot read image from media id 500507, drive index 1, err = 234
Reading and writing tapes is done at the OS level. Whilst NetBackup may detect the issue, it is not caused by NetBackup. The cause of read and write errors are usually an issue with the tape drive or media cartridge.
Note:
- McAfee Antivirus software is known to be a possible cause of Status 84 errors on Windows Media Servers.
- Cyclic Redundancy Check errors always show faulty hardware.
- MSEO is not compatible with Asynchronous Tapemarks which started in NetBackup 7.1 Symptoms include write or read errors on tapes encrypted with MESO. Creating an empty file /usr/openv/netbackup/db/config/DISABLE_IMMEDIATE_WEOF will resolve the issue
I/O Error
Symptom:
Errors using the term I/O
For example:
11:20:18.246 [8504.5292] <4> write_data: WriteFile failed with: The request could not be performed because of an I/O device error. (1117); bytes written = 65536; size = 0
I/O errors are caused at a hardware level, and are only detected by NetBackup. Your hardware vendor should be able to help to find the root cause.
Known issues:
open failed in io_open I/O error
This exact error message can be caused by mis-configuration of the drives. If the issue remains after this is checked, your hardware vendor should be able to help to find the root cause.
External event has caused rewind
Symptom:
The error will look similar to the following:
<2> io_terminate_tape: block position check: actual 4, expected 5
<16> write_data: FREEZING media id XXXXXX, External event caused rewind during write, all data on media is lost
This issue can be serious and must be looked into, as data can be lost.
NetBackup keeps track of how much data the OS has written to the device. NetBackup then asks the tape device for its position at the end of each write. If this position does not match NetBackup's calculation then the job will fail with a media write error.
A simple block check cannot show if a full rewind happened, but when a position check fails, it is most likely that the reported position is less than the calculated position.
If a full rewind has occurred, the NetBackup header on the tape will have been wiped. The tape cannot be read and the data on the media is lost.
The most common cause of this is a SCSI reset on the SAN. This causes a rewind of the drive as it is being written to. This event cannot be seen by NetBackup, and is only found after, when the block position check fails. NetBackup cannot cause SCSI resets as tape positioning and read and write operations are controlled by the OS.
If the issue is a position error (rather than a full rewind) the bptm log will contain a message like:
<2> write_data: block position check: actual 62504, expected 31254
<16> write_data: FREEZING media id XXXXXX, too many data blocks written, check tape/driver block size configuration
There can be many causes. These most commonly include:
- Tape driver issue
- Tape drive firmware issue
- SAN fault
- HBA driver or firmware issue, or other fault
- Switch Fault
Your hardware vendor or OS support should be able to help to find the root cause.
Note:
The SCSI reservation is set and held by the HBA. However, NetBackup sends the reserve command through the SCSI pass-through path so this must be configured correctly.
Known Issues:
NDMP
If the drives connect to a NDMP device, check that the SCSI reservation type on the filer and in NetBackup, match.
If they don't match the issue can be fixed by the following steps:
- Open the NetBackup GUI
- Select Host Properties
- Select the Media Type tab
- Check the SCSI reservation set: SPC2 or SCSI persistent
- On the filer, change the type to match.
- Reboot the Library to clear current reservations.
HP-UX 11.31 IA64 / atdd driver
BPTM block position check fails one block short when using IBM atdd driver 6.0.0.96 on HP-UX 11.31 IA64
This issue is caused by the HP ATDD driver writing the EOT mark incorrectly.
However, Veritas has produced a NetBackup 7.0.1 EEB to workaround this issue (See 100005443). NetBackup 7.0.1 and later installed on HP-UX 11.31 IA64 requires the atdd driver to be 6.0.2.8 or later. Upgrade to the new ATDD driver to fix the issue.
Tapes not reaching capacity
Symptom:
Less data is written to the the tape than expected.
For example, only 300 GB of Data is written to a 400 GB capacity tape
NetBackup passes data to the OS, one block at a time, to be written to the tape drive. NetBackup has no concept of tape capacity. In theory, it would keep writing to the same tape "forever".
The tape drive firmware detects when the tape physically passes the logical end-of-tape. It then sets a 'flag' in the tape driver. There is still enough physical space on the tape for the current block to be written, so this completes successfully. NetBackup then tries to send the next block of data but the tape driver refuses, as the 'tape full' flag is set. The driver passes this 'tape full' message to the OS, which then passes it to NetBackup. Only once this has happened will NetBackup request the tape be changed.
Common causes of this issue are tape drive firmware, or faulty hardware.
There are no settings in NetBackup that control tape capacity. Your hardware vendor or OS support should be able to help to find the root cause.
Tapes being incorrectly marked as 'read only'
Symptom:
TapeAlert such as:
0x09: 'Cartridge write protected'
NetBackup has no concept of 'read only'. This is set by the tape drive, often by means of a small, physical, switch on the tape cartridge. This is then reported to NetBackup by the drive's firmware. So, if a tape is being shown as 'read only' this issue cannot be the fault of NetBackup.
Tape drive firmware issues can make tape media show up as read only when they are not.
Your hardware vendor should be able to help to find the root cause.
Library Inventory Issues
Symptom:
Tapes are missing from a library inventory, even when they are known to be inside.
NetBackup does not directly 'Inventory' a library. It only asks the library to report on which tapes it contains and where they are stored. If NetBackup cannot 'see' a tape it is because the library is not reporting it.
Common library issues are tapes appearing in the wrong slot or tapes/slots not showing at all. This is not caused by NetBackup.
Your hardware vendor should be able to help to find the root cause.
Known issues:
Issues with Virtual I/O slots on IBM 3500 series libraries with ALMS/Virtual I/O:
- Open the IBM web console
- Set "Queued Exports" to 'HIDE'
This should let tapes be moved from the virtual I/O slots, to those within the logical library.
Robotic Library load issue
Symptom:
Library is not able to load tapes.
Error bptm error requesting media TpErrno = Robot operation failed
This error is seen in the bptm log, may be referenced in the .../volmgr/debug log, and possibly also in the OS event log.
A great way to look into this issue is to use the robtest command. See 100022873 for more details. Robtest does not use any NetBackup commands. It only sends OS level SCSI commands to the library, and the output shown comes from the library itself. This makes it a good way to rule out NetBackup from being the source of the issue.
For example, trying to move a tape from slot 86 to drive 2:
m s86 d2
move_medium failed
sense key = 0x4, asc = 0x15, ascq = 0x1, MECHANICAL POSITIONING ERROR
As robtest has only sent a SCSI move request, NetBackup has no part in this.
Your library vendor should be able to help to find the root cause.
Missing drives, or drives disappearing and reappearing
Symptom:
The number of missing devices shown by tpautconf -report_disc varies when re-run.
Example:
======================= Missing Device (Drive) ======================
Inquiry = "IBM Ultrium 3-SCSI
Serial Number = HM74536FFS
Drive Path = /dev/rmt/0cbn
Drive Name = DRV_F2D3_LTO5
The tpautoconf -report_disc command will show "Missing Device" if a drive is set up in NetBackup but the OS can no longer see it.
SAN issues are the most common reason for this issue. Especially if devices vanish and reappear.
NetBackup has no control over the connection between the devices and the OS.
Your SAN admin should be able to help to find the root cause.
Tapes failing to mount in NetBackup, but visible and usable by operating system commands
Symptom:
A job hangs on the tape mount, failing with status 98 after some time, but the OS can use the tape.
A few cases have been seen in which tapes are physically loaded into the tape drive. The OS can see them and they respond to OS commands (such as mt and dd), but NetBackup is unable to mount the tape.
At first glance it might seem that NetBackup is at fault, however, in these cases it was found that the fault was caused by the tape drive firmware.
Issues moving tapes to/from slots or drives
Symptom:
This could show in varied messages.
Example:
Auto empty media export request rejected by TLDCD; Cannot move from media access port
Trying to move the tape in port 1 of the CAP to slot 28 using robtest:
m p1 s28
Initiating MOVE_MEDIUM from address 10 to 1027
move_medium failed
sense key = 0x4, asc = 0x40, ascq = 0x1, UNKNOWN ERROR, KEY: 0x04, ASC: 0x40, ASCQ: 0x01
Check that a robot inventory has been recently run.
NetBackup sends standard SCSI commands to the library. If they do not work there is an issue with the library.
Your library vendor should be able to help to find the root cause.
Issues with Cartridge memory
Symptom:
The library shows errors like:
Description: The memory in the tape cartridge has failed.
Description: The tape drive encountered a problem while loading a tape cartridge.
Description: The tape drive detected an internal hardware problem.
Description: The tape drive has an error which requires the tape cartridge to be ejected for error recovery
LTO tapes contain a small EEPROM chip, known as LTO-CM.
This chip stores useful facts:
- the LTO tape generation
- an error log
- the manufacturer's details
- the position of data on the tape, for fast block positioning.
If the chip fails, errors like the above will be seen.
Your hardware vendor should be able to help to find the root cause.
Cleaning Tape
Symptom:
In NetBackup 7.5, on occasion, a cleaning cycle run by NetBackup will fail.
The symptoms may differ slightly :
A. The tape cannot be unloaded, the /var/adm/message log will show:
Mar 14 12:49:38 server02 tldcd [19756]: [ID 559682 daemon.notice] TLD(2) closing/unlocking robotic path
Mar 14 12:49:38 server02 tldcd [9536]: [ID 919746 daemon.notice] inquiry() function processing library ADIC Scalar i2000 607A:
Mar 14 12:49:38 server02 tldd [9524]: [ID 583323 daemon.notice] DecodeClean: TLD(2) drive 5, Actual status: Unable to SCSI unload drive
Mar 14 12:49:39 server02 ltid [9497]: [ID 512328 daemon.notice] LTID - received ROBOT MESSAGE, Type=55, LongParam=0, Param1=1, Param2=10
Mar 14 12:49:39 server02 ltid [9497]: [ID 581313 daemon.error] Cleaning for drive 1 failed, status = Unable to SCSI unload drive
Mar 14 12:49:48 server02 bptm [19765]: [ID 946237 daemon.warning] TapeAlert Code: 0x0b, Type: Informational, Flag: CLEANING MEDIA, from drive PER-i2000-Drive5 (index 1), Media Id CLN001
Mar 14 12:49:49 server02 ltid [9497]: [ID 560358 daemon.notice] LTID - Sent ROBOTIC request, Type=3, Param2=1
B. Once the tape drive is cleaned a new tape is loaded and reloaded repeatedly, the /usr/openv/volmgr/debug/robots log will show:
12:43:53.753 [3016] <4> AddTldLtiReqEntry: Processing ROBOT_CLEAN request...
12:43:53.753 [3016] <5> CleanDrive: TLD(0) Cleaning Tape 4TP012 on drive 5, from slot 41
12:43:53.758 [3424] <4> io_open: Drive Path = /dev/rmt/6cbn ...
12:43:54.018 [3026] <5> tldcd:mount_unmount_drive: Processing MOUNT, TLD(0) drive 5, slot 41, barcode CLN4TP012L4 , vsn 4TP012 ...
12:46:01.764 [3026] <5> tldcd:mount_unmount_drive: Processing UNMOUNT, TLD(0) drive 5, slot 41, barcode CLN4TP012L4 , vsn 4TP012 ...
12:46:12.789 [3016] <5> GetResponseStatus: DecodeClean: TLD(0) drive 5, Actual status: Unable to SCSI unload drive
The cause of this issue is due to the 'access bit to be set to 1' on the tape drive.
The formal resolution for this issue (Etrack 2714791) is included in the following release:
NetBackup 7.5 Maintenance Release 1 (7.5.0.1)
Status 277
Symptoms:
Drives are down in Device Manager. When started, they fail with a NetBackup Status Code 277: the drive is not ready or inoperable.
In general, NetBackup Status Code 277 appears when device changes are made in NetBackup and either the services are not cycled, or the server is not rebooted.
On a Windows server, two areas to check are in Windows Device Manager and the Tape Driver.
First, check the drives are shown in Device Manager and check the correct number of drives are detected and that no drives show errors. Then check the driver details for each Tape Drive. One at a time, select each tape drive listed, right click and choose Properties, then select the Driver tab. If the Driver Provider shows as being "Unknown" then it is necessary to reload the tape driver. After loading a new driver, reboot the server.
If the drives seem OK, stop and restart the NetBackup services on the server. Not doing this is known to cause Status Code 277 errors.
Phase 2 Import fails - Windows only
Phase 2 import fails with the following error:
14:02:24.655 [6060.2952] <2> mpx_read_data: ReadFile returned FALSE, More data is available. (234);bytes = 65536
14:02:24.655 [6060.2952] <2> is_possible_recoverable_error: not attempting error recovery, errno = 234
14:02:24.655 [6060.2952] <2> set_job_details: Tfile (227): LOG 1592020944 16 bptm 6060 cannot read image from media id AA1234, drive index 0, err = 234
14:02:24.655 [6060.2952] <2> send_job_file: job ID 227, ftype = 3 msg len = 93, msg = LOG 1592020944 16 bptm 6060 cannot read image from media id AA1234, drive index 0, err = 234
NetBackup SIZE_DATA_BUFFERS has no effect on tape read. The blocksize used to read a tape is always the same as that used to write.
A known issue exists where the Windows registry value for MaximumSGList is incorrect, HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\<SCSI_CARD_NAME>\Parameters\Device\. For more details, please request details on KB article 100016476 from Support.