Veritas InfoScale™ 7.3.1 Troubleshooting Guide - Linux

Last Published:
Product(s): InfoScale & Storage Foundation (7.3.1)
  1. Introduction
    1.  
      About troubleshooting Veritas InfoScale Storage Foundation and High Availability Solutions products
    2.  
      About Veritas Services and Operations Readiness Tools (SORT)
    3. About unique message identifiers
      1.  
        Using Veritas Operations Readiness Tools to find a Unique Message Identifier description and solution
    4. About collecting application and daemon core data for debugging
      1.  
        Letting vxgetcore find debugging data automatically (the easiest method)
      2.  
        Running vxgetcore when you know the location of the core file
      3.  
        Letting vxgetcore prompt you for information
  2. Section I. Troubleshooting Veritas File System
    1. Diagnostic messages
      1. File system response to problems
        1.  
          Recovering a disabled file system
      2.  
        About kernel messages
  3. Section II. Troubleshooting Veritas Volume Manager
    1. Recovering from hardware failure
      1.  
        About recovery from hardware failure
      2.  
        Listing unstartable volumes
      3.  
        Displaying volume and plex states
      4.  
        The plex state cycle
      5.  
        Recovering an unstartable mirrored volume
      6.  
        Recovering an unstartable volume with a disabled plex in the RECOVER state
      7.  
        Forcibly restarting a disabled volume
      8.  
        Clearing the failing flag on a disk
      9.  
        Reattaching failed disks
      10.  
        Recovering from a failed plex attach or synchronization operation
      11. Failures on RAID-5 volumes
        1.  
          System failures
        2.  
          Disk failures
        3.  
          Default startup recovery process for RAID-5
        4. Recovery of RAID-5 volumes
          1.  
            Resynchronizing parity on a RAID-5 volume
          2.  
            Reattaching a failed RAID-5 log plex
          3.  
            Recovering a stale subdisk in a RAID-5 volume
        5.  
          Recovery after moving RAID-5 subdisks
        6. Unstartable RAID-5 volumes
          1.  
            Forcibly starting a RAID-5 volume with stale subdisks
      12.  
        Recovering from an incomplete disk group move
      13.  
        Restarting volumes after recovery when some nodes in the cluster become unavailable
      14. Recovery from failure of a DCO volume
        1.  
          Recovering a version 0 DCO volume
        2.  
          Recovering an instant snap DCO volume (version 20 or later)
    2. Recovering from instant snapshot failure
      1.  
        Recovering from the failure of vxsnap prepare
      2.  
        Recovering from the failure of vxsnap make for full-sized instant snapshots
      3.  
        Recovering from the failure of vxsnap make for break-off instant snapshots
      4.  
        Recovering from the failure of vxsnap make for space-optimized instant snapshots
      5.  
        Recovering from the failure of vxsnap restore
      6.  
        Recovering from the failure of vxsnap refresh
      7.  
        Recovering from copy-on-write failure
      8.  
        Recovering from I/O errors during resynchronization
      9.  
        Recovering from I/O failure on a DCO volume
      10.  
        Recovering from failure of vxsnap upgrade of instant snap data change objects (DCOs)
    3. Recovering from failed vxresize operation
      1.  
        Recovering from a failed vxresize shrink operation
    4. Recovering from boot disk failure
      1.  
        VxVM and boot disk failure
      2.  
        Possible root disk configurations
      3.  
        The boot process
      4. VxVM boot disk recovery
        1. Failed boot disk
          1.  
            Reconnecting a disconnected root disk
          2.  
            Failed root disk
          3.  
            Substituting a root mirror disk for a failed root disk
          4.  
            Replacing a failed root disk
        2.  
          Replacing a failed boot disk mirror
        3.  
          Accidental use of the -R, fallback or lock option with LILO
        4.  
          Restoring a missing or corrupted master boot record
        5.  
          Restoring a missing or corrupted /etc/fstab file
        6.  
          Restoring a missing or corrupted /etc/vx/volboot file
      5. Recovery by reinstallation
        1.  
          General reinstallation information
        2. Reinstalling the system and recovering VxVM
          1.  
            Prepare the system for reinstallation
          2.  
            Reinstalling the operating system
          3.  
            Reinstalling Veritas Volume Manager
          4.  
            Recovering the Veritas Volume Manager configuration
          5.  
            Cleaning up the system configuration
      6.  
        Manually unencapsulating a root disk
    5. Managing commands, tasks, and transactions
      1.  
        Command logs
      2.  
        Task logs
      3.  
        Transaction logs
      4.  
        Association of command, task, and transaction logs
      5.  
        Associating CVM commands issued from slave to master node
      6.  
        Command completion is not enabled
    6. Backing up and restoring disk group configurations
      1.  
        About disk group configuration backup
      2.  
        Backing up a disk group configuration
      3. Restoring a disk group configuration
        1.  
          Resolving conflicting backups for a disk group
      4.  
        Backing up and restoring Flexible Storage Sharing disk group configuration data
    7. Troubleshooting issues with importing disk groups
      1.  
        Clearing the udid_mismatch flag for non-clone disks
    8. Recovering from CDS errors
      1.  
        CDS error codes and recovery actions
    9. Logging and error messages
      1.  
        About error messages
      2. How error messages are logged
        1.  
          Configuring logging in the startup script
      3. Types of messages
        1.  
          Messages
      4.  
        Collecting log information for troubleshooting
    10. Troubleshooting Veritas Volume Replicator
      1.  
        Recovery from RLINK connect problems
      2. Recovery from configuration errors
        1. Errors during an RLINK attach
          1.  
            Data volume errors during an RLINK attach
          2.  
            Volume set errors during an RLINK attach
        2. Errors during modification of an RVG
          1.  
            Missing data volume error during modifcation of an RVG
          2.  
            Data volume mismatch error during modification of an RVG
          3.  
            Data volume name mismatch error during modification of an RVG
          4. Volume set configuration errors during modification of an RVG
            1.  
              Volume set name mismatch error
            2.  
              Volume index mismatch error
            3.  
              Component volume mismatch error
      3. Recovery on the Primary or Secondary
        1.  
          About recovery from a Primary-host crash
        2. Recovering from Primary data volume error
          1.  
            Example - Recovery with detached RLINKs
          2.  
            Example - Recovery with minimal repair
          3.  
            Example - Recovery by migrating the primary
          4.  
            Example - Recovery from temporary I/O error
        3. Primary SRL volume error cleanup and restart
          1.  
            About RVG PASSTHRU mode
        4.  
          Primary SRL volume error at reboot
        5.  
          Primary SRL volume overflow recovery
        6. Primary SRL header error cleanup and recovery
          1.  
            Recovering from SRL header error
        7. Secondary data volume error cleanup and recovery
          1.  
            Recovery using a Secondary Storage Checkpoint
          2.  
            Cleanup using a Primary Storage Checkpoint
        8.  
          Secondary SRL volume error cleanup and recovery
        9.  
          Secondary SRL header error cleanup and recovery
        10.  
          Secondary SRL header error at reboot
    11. Troubleshooting issues in cloud deployments
      1.  
        In an Azure environment, exporting a disk for Flexible Storage Sharing (FSS) may fail with "Disk not supported for FSS operation" error
  4. Section III. Troubleshooting Dynamic Multi-Pathing
    1. Dynamic Multi-Pathing troubleshooting
      1.  
        Recovering from errors when you exclude or include paths to DMP
      2.  
        Downgrading the array support
  5. Section IV. Troubleshooting Storage Foundation Cluster File System High Availability
    1. Troubleshooting Storage Foundation Cluster File System High Availability
      1.  
        About troubleshooting Storage Foundation Cluster File System High Availability
      2. Troubleshooting CFS
        1.  
          Incorrect order in root user's <library> path
        2.  
          CFS commands might hang when run by a non-root user
      3. Troubleshooting fenced configurations
        1.  
          Example of a preexisting network partition (split-brain)
        2. Recovering from a preexisting network partition (split-brain)
          1.  
            Example Scenario I
          2.  
            Example Scenario II
          3.  
            Example Scenario III
      4. Troubleshooting Cluster Volume Manager in Veritas InfoScale products clusters
        1.  
          CVM group is not online after adding a node to the Veritas InfoScale products cluster
        2.  
          Shared disk group cannot be imported in Veritas InfoScale products cluster
        3.  
          Unable to start CVM in Veritas InfoScale products cluster
        4.  
          Removing preexisting keys
        5.  
          CVMVolDg not online even though CVMCluster is online in Veritas InfoScale products cluster
        6.  
          Shared disks not visible in Veritas InfoScale products cluster
      5. Troubleshooting interconnects
        1.  
          Restoring communication between host and disks after cable disconnection
        2.  
          Network interfaces change their names after reboot
        3.  
          Example entries for mandatory devices
  6. Section V. Troubleshooting Cluster Server
    1. Troubleshooting and recovery for VCS
      1. VCS message logging
        1.  
          Log unification of VCS agent's entry points
        2.  
          Enhancing First Failure Data Capture (FFDC) to troubleshoot VCS resource's unexpected behavior
        3.  
          GAB message logging
        4.  
          Enabling debug logs for agents
        5.  
          Enabling debug logs for IMF
        6.  
          Enabling debug logs for the VCS engine
        7.  
          About debug log tags usage
        8. Gathering VCS information for support analysis
          1.  
            Verifying the metered or forecasted values for CPU, Mem, and Swap
        9.  
          Gathering LLT and GAB information for support analysis
        10.  
          Gathering IMF information for support analysis
        11.  
          Message catalogs
      2. Troubleshooting the VCS engine
        1.  
          HAD diagnostics
        2.  
          HAD restarts continuously
        3.  
          DNS configuration issues cause GAB to kill HAD
        4.  
          Seeding and I/O fencing
        5.  
          Preonline IP check
      3. Troubleshooting Low Latency Transport (LLT)
        1.  
          LLT startup script displays errors
        2.  
          LLT detects cross links usage
        3.  
          LLT link status messages
      4. Troubleshooting Group Membership Services/Atomic Broadcast (GAB)
        1.  
          Delay in port reopen
        2.  
          Node panics due to client process failure
      5. Troubleshooting VCS startup
        1.  
          "VCS: 10622 local configuration missing" and "VCS: 10623 local configuration invalid"
        2.  
          "VCS:11032 registration failed. Exiting"
        3.  
          "Waiting for cluster membership."
      6. Troubleshooting issues with systemd unit service files
        1.  
          If a unit service has failed and the corresponding module is still loaded, systemd cannot unload it and so its package cannot be removed
        2.  
          If a unit service is active and the corresponding process is stopped outside of systemd, the service cannot be started again using 'systemctl start'
        3.  
          If a unit service takes longer than the default timeout to stop or start the corresponding service, it goes into the Failed state
      7.  
        Troubleshooting Intelligent Monitoring Framework (IMF)
      8. Troubleshooting service groups
        1.  
          VCS does not automatically start service group
        2.  
          System is not in RUNNING state
        3.  
          Service group not configured to run on the system
        4.  
          Service group not configured to autostart
        5.  
          Service group is frozen
        6.  
          Failover service group is online on another system
        7.  
          A critical resource faulted
        8.  
          Service group autodisabled
        9.  
          Service group is waiting for the resource to be brought online/taken offline
        10.  
          Service group is waiting for a dependency to be met.
        11.  
          Service group not fully probed.
        12.  
          Service group does not fail over to the forecasted system
        13.  
          Service group does not fail over to the BiggestAvailable system even if FailOverPolicy is set to BiggestAvailable
        14.  
          Restoring metering database from backup taken by VCS
        15.  
          Initialization of metering database fails
      9. Troubleshooting resources
        1.  
          Service group brought online due to failover
        2.  
          Waiting for service group states
        3.  
          Waiting for child resources
        4.  
          Waiting for parent resources
        5.  
          Waiting for resource to respond
        6. Agent not running
          1.  
            Invalid agent argument list.
        7.  
          The Monitor entry point of the disk group agent returns ONLINE even if the disk group is disabled
      10. Troubleshooting I/O fencing
        1.  
          Node is unable to join cluster while another node is being ejected
        2.  
          The vxfentsthdw utility fails when SCSI TEST UNIT READY command fails
        3.  
          Manually removing existing keys from SCSI-3 disks
        4. System panics to prevent potential data corruption
          1.  
            How I/O fencing works in different event scenarios
        5.  
          Cluster ID on the I/O fencing key of coordinator disk does not match the local cluster's ID
        6. Fencing startup reports preexisting split-brain
          1.  
            Clearing preexisting split-brain condition
        7.  
          Registered keys are lost on the coordinator disks
        8.  
          Replacing defective disks when the cluster is offline
        9.  
          The vxfenswap utility exits if rcp or scp commands are not functional
        10. Troubleshooting CP server
          1.  
            Troubleshooting issues related to the CP server service group
          2.  
            Checking the connectivity of CP server
        11. Troubleshooting server-based fencing on the Veritas InfoScale products cluster nodes
          1.  
            Issues during fencing startup on VCS nodes set up for server-based fencing
        12. Issues during online migration of coordination points
          1.  
            Vxfen service group activity after issuing the vxfenswap command
      11. Troubleshooting notification
        1.  
          Notifier is configured but traps are not seen on SNMP console.
      12. Troubleshooting and recovery for global clusters
        1.  
          Disaster declaration
        2.  
          Lost heartbeats and the inquiry mechanism
        3. VCS alerts
          1.  
            Types of alerts
          2.  
            Managing alerts
          3.  
            Actions associated with alerts
          4.  
            Negating events
          5.  
            Concurrency violation at startup
      13.  
        Troubleshooting the steward process
      14. Troubleshooting licensing
        1.  
          Validating license keys
        2. Licensing error messages
          1.  
            [Licensing] Insufficient memory to perform operation
          2.  
            [Licensing] No valid VCS license keys were found
          3.  
            [Licensing] Unable to find a valid base VCS license key
          4.  
            [Licensing] License key cannot be used on this OS platform
          5.  
            [Licensing] VCS evaluation period has expired
          6.  
            [Licensing] License key can not be used on this system
          7.  
            [Licensing] Unable to initialize the licensing framework
          8.  
            [Licensing] QuickStart is not supported in this release
          9.  
            [Licensing] Your evaluation period for the feature has expired. This feature will not be enabled the next time VCS starts
      15.  
        Verifying the metered or forecasted values for CPU, Mem, and Swap
  7. Section VI. Troubleshooting SFDB
    1. Troubleshooting SFDB
      1.  
        About troubleshooting Storage Foundation for Databases (SFDB) tools

Cleaning up the system configuration

After reinstalling VxVM, you must clean up the system configuration.

To clean up the system configuration

  1. After recovering the VxVM configuration, you must determine which volumes need to be restored from backup because a complete copy of their data is not present on the recovered disks. Such volumes are invalid and must be removed, recreated, and restored from backup. If a complete copy of a volume's data is available, it can be repaired by the hot-relocation feature provided that this is enabled and there is sufficient spare disk space in the disk group.

    Establish which VM disks have been removed or reinstalled using the following command:

    # vxdisk list

    This displays a list of system disk devices and the status of these devices. For example, for a reinstalled system with three disks and a reinstalled root disk, the output of the vxdisk list command is similar to this:

    DEVICE    TYPE      DISK      GROUP      STATUS
    sdb       simple    -         -          error
    sdc       simple    disk02    bootdg     online
    sdd       simple    disk03    bootdg     online
    -         -         disk01    bootdg     failed was:sdb

    The display shows that the reinstalled root device, sdb, is not associated with a VM disk and is marked with a status of error. The disks disk02 and disk03 were not involved in the reinstallation and are recognized by VxVM and associated with their devices (sdc and sdd). The former disk01, which was the VM disk associated with the replaced disk device, is no longer associated with the device (sdb).

    If other disks (with volumes or mirrors on them) had been removed or replaced during reinstallation, those disks would also have a disk device listed in error state and a VM disk listed as not associated with a device.

  2. Once you know which disks have been removed or replaced, locate all the mirrors on failed disks using the following command:
    # vxprint -sF "%vname" -e 'sd_disk = "disk"'

    where disk is the name of a disk with a failed status. Be sure to enclose the disk name in quotes in the command. Otherwise, the command returns an error message. The vxprint command returns a list of volumes that have mirrors on the failed disk. Repeat this command for every disk with a failed status.

  3. Check the status of each volume and print volume information using the following command:
    # vxprint -th volume 

    where volume is the name of the volume to be examined. The vxprint command displays the status of the volume, its plexes, and the portions of disks that make up those plexes. For example, a volume named v01 with only one plex resides on the reinstalled disk named disk01. The vxprint -th v01 command produces the following output:

    V    NAME      USETYPE   KSTATE     STATE     LENGTH    READPOL     PREFPLEX
    PL   NAME      VOLUME    KSTATE     STATE     LENGTH    LAYOUT      NCOL/WID   MODE
    SD   NAME      PLEX      DISK       DISKOFFS  LENGTH    [COL/]OFF   DEVICE     MODE
    
    v    v01       fsgen     DISABLED   ACTIVE    24000     SELECT      -
    pl   v01-01    v01       DISABLED   NODEVICE  24000     CONCAT      -          RW
    sd   disk01-06 v0101     disk01     245759    24000     0           sdg        ENA

    The only plex of the volume is shown in the line beginning with pl. The STATE field for the plex named v01-01 is NODEVICE. The plex has space on a disk that has been replaced, removed, or reinstalled. The plex is no longer valid and must be removed.

  4. Because v01-01 was the only plex of the volume, the volume contents are irrecoverable except by restoring the volume from a backup. The volume must also be removed. If a backup copy of the volume exists, you can restore the volume later. Keep a record of the volume name and its length, as you will need it for the backup procedure.

    Remove irrecoverable volumes (such as v01) using the following command:

      # vxedit -r rm v01
    
  5. It is possible that only part of a plex is located on the failed disk. If the volume has a striped plex associated with it, the volume is divided between several disks. For example, the volume named v02 has one striped plex striped across three disks, one of which is the reinstalled disk disk01. The vxprint -th v02 command produces the following output:
    V    NAME     USETYPE    KSTATE    STATE     LENGTH     READPOL     PREFPLEX
    PL   NAME     VOLUME     KSTATE    STATE     LENGTH     LAYOUT      NCOL/WID  MODE
    SD   NAME     PLEX       DISK      DISKOFFS  LENGTH     [COL/]OFF   DEVICE    MODE
    
    v    v02      fsgen     DISABLED   ACTIVE     30720     SELECT      v02-01
    pl   v02-01   v02       DISABLED   NODEVICE   30720     STRIPE      3/128     RW
    sd   disk02-02v02-01    disk01     424144     10240     0/0         sdi       ENA
    sd   disk01-05v02-01    disk01     620544     10240     1/0         sdj       DIS
    sd   disk03-01v02-01    disk03     620544     10240     2/0         sdk       ENA

    The display shows three disks, across which the plex v02-01 is striped (the lines starting with sd represent the stripes). One of the stripe areas is located on a failed disk. This disk is no longer valid, so the plex named v02-01 has a state of NODEVICE. Since this is the only plex of the volume, the volume is invalid and must be removed. If a copy of v02 exists on the backup media, it can be restored later. Keep a record of the volume name and length of any volume you intend to restore from backup.

    Remove invalid volumes (such as v02) using the following command:

    # vxedit -r rm v02
  6. A volume that has one mirror on a failed disk may also have other mirrors on disks that are still valid. In this case, the volume does not need to be restored from backup, since all the data is still available, and recovery can usually be handled by the hot-relocation feature provided that this is enabled.

    If hot-relocation is disabled, you can recover the mirror manually. In this example, the vxprint -th command for a volume with one plex on a failed disk (disk01) and another plex on a valid disk (disk02) produces the following output:

    V    NAME       USETYPE   KSTATE    STATE     LENGTH   READPOL    PREFPLEX
    PL   NAME       VOLUME    KSTATE    STATE     LENGTH   LAYOUT     NCOL/WID   MODE
    SD   NAME       PLEX      DISK      DISKOFFS  LENGTH   [COL/]OFF  DEVICE     MODE
    
    v    v03        fsgen     DISABLED  ACTIVE    0720     SELECT      -
    pl   v03-01     v03       DISABLED  ACTIVE    30720    CONCAT      -         RW
    sd   disk02-01 v03-01     disk01    620544    30720    0           sdl       ENA
    pl   v03-02     v03       DISABLED  NODEVICE  30720    CONCAT      -         RW
    sd   disk01-04 v03-02     disk03    262144    30720    0           sdm       DIS

    This volume has two plexes, v03-01 and v03-02. The first plex (v03-01) does not use any space on the invalid disk, so it can still be used. The second plex (v03-02) uses space on invalid disk disk01 and has a state of NODEVICE. Plex v03-02 must be removed. However, the volume still has one valid plex containing valid data. If the volume needs to be mirrored, another plex can be added later. Note the name of the volume to create another plex later.

    To remove an invalid plex, use the vxplex command to dissociate and then remove the plex from the volume. For example, to dissociate and remove the plex v03-02, use the following command:

    # vxplex -o rm dis v03-02
  7. Once all invalid volumes and plexes have been removed, the disk configuration can be cleaned up. Each disk that was removed, reinstalled, or replaced (as determined from the output of the vxdisk list command) must be removed from the configuration.

    To remove the disk, use the vxdg command. To remove the failed disk disk01, use the following command:

    # vxdg rmdisk disk01

    If the vxdg command returns an error message, invalid mirrors exist.

    Repeat step 1 through step 6 until all invalid volumes and mirrors are removed.

  8. Once all the invalid disks have been removed, the replacement or reinstalled disks can be added to Veritas Volume Manager control. If the root disk was originally under Veritas Volume Manager control or you now wish to put the root disk under Veritas Volume Manager control, add this disk first.

    To add the root disk to Veritas Volume Manager control, use the vxdiskadm command:

    # vxdiskadm

    From the vxdiskadm main menu, select menu item 2 (Encapsulate a disk). Follow the instructions and encapsulate the root disk for the system.

  9. When the encapsulation is complete, reboot the system to multi-user mode.
  10. Once the root disk is encapsulated, any other disks that were replaced should be added using the vxdiskadm command. If the disks were reinstalled during the operating system reinstallation, they should be encapsulated; otherwise, they can be added.
  11. Once all the disks have been added to the system, any volumes that were completely removed as part of the configuration cleanup can be recreated and their contents restored from backup. The volume recreation can be done by using the vxassist command or the graphical user interface.

    For example, to recreate the volumes v01 and v02, use the following commands:

    # vxassist make v01 24000 
    # vxassist make v02 30720 layout=stripe nstripe=3

    Once the volumes are created, they can be restored from backup using normal backup/restore procedures.

  12. Recreate any plexes for volumes that had plexes removed as part of the volume cleanup. To replace the plex removed from volume v03, use the following command:
    # vxassist mirror v03

    Once you have restored the volumes and plexes lost during reinstallation, recovery is complete and your system is configured as it was prior to the failure.