InfoScale™ 9.0 Disaster Recovery Implementation Guide - Linux

Last Published:
Product(s): InfoScale & Storage Foundation (9.0)
Platform: Linux
  1. Section I. Introducing Storage Foundation and High Availability Solutions for disaster recovery
    1. About supported disaster recovery scenarios
      1.  
        About disaster recovery scenarios
      2. About campus cluster configuration
        1.  
          VCS campus cluster requirements
        2.  
          How VCS campus clusters work
        3.  
          Typical VCS campus cluster setup
      3. About replicated data clusters
        1.  
          How VCS replicated data clusters work
      4. About global clusters
        1.  
          How VCS global clusters work
        2.  
          User privileges for cross-cluster operations
        3. VCS global clusters: The building blocks
          1.  
            Visualization of remote cluster objects
          2.  
            About global service groups
          3. About global cluster management
            1.  
              About the wide-area connector process
            2.  
              About the wide-area heartbeat agent
            3.  
              Sample configuration for the wide-area heartbeat agent
          4. About serialization - The Authority attribute
            1.  
              About the Authority and AutoStart attributes
          5.  
            About resiliency and "Right of way"
          6.  
            VCS agents to manage wide-area failover
          7.  
            About the Steward process: Split-brain in two-cluster global clusters
          8.  
            Secure communication in global clusters
      5.  
        Disaster recovery feature support for components in the Veritas InfoScale product suite
      6.  
        Virtualization support for InfoScale 9.0 products in replicated environments
    2. Planning for disaster recovery
      1. Planning for cluster configurations
        1.  
          Planning a campus cluster setup
        2.  
          Planning a replicated data cluster setup
        3.  
          Planning a global cluster setup
      2. Planning for data replication
        1.  
          Data replication options
        2.  
          Data replication considerations
  2. Section II. Implementing campus clusters
    1. Setting up campus clusters for VCS and SFHA
      1. About setting up a campus cluster configuration
        1.  
          Preparing to set up a campus cluster configuration
        2.  
          Configuring I/O fencing to prevent data corruption
        3.  
          Configuring VxVM disk groups for campus cluster configuration
        4.  
          Configuring VCS service group for campus clusters
        5.  
          Setting up campus clusters for VxVM and VCS using Veritas InfoScale Operations Manager
      2.  
        Fire drill in campus clusters
      3.  
        About the DiskGroupSnap agent
      4. About running a fire drill in a campus cluster
        1.  
          Configuring the fire drill service group
        2.  
          Running a successful fire drill in a campus cluster
    2. Setting up campus clusters for SFCFSHA, SFRAC
      1.  
        About setting up a campus cluster for disaster recovery for SFCFSHA or SF Oracle RAC
      2.  
        Preparing to set up a campus cluster in a parallel cluster database environment
      3.  
        Configuring I/O fencing to prevent data corruption
      4.  
        Configuring VxVM disk groups for a campus cluster in a parallel cluster database environment
      5.  
        Configuring VCS service groups for a campus cluster for SFCFSHA and SF Oracle RAC
      6.  
        Tuning guidelines for parallel campus clusters
      7.  
        Best practices for a parallel campus cluster
  3. Section III. Implementing replicated data clusters
    1. Configuring a replicated data cluster using VVR
      1. About setting up a replicated data cluster configuration
        1.  
          About typical replicated data cluster configuration
        2.  
          About setting up replication
        3.  
          Configuring the service groups
        4.  
          Configuring the service group dependencies
      2. About migrating a service group
        1.  
          Switching the service group
      3.  
        Fire drill in replicated data clusters
    2. Configuring a replicated data cluster using third-party replication
      1.  
        About setting up a replicated data cluster configuration using third-party replication
      2.  
        About typical replicated data cluster configuration using third-party replication
      3.  
        About setting up third-party replication
      4.  
        Configuring the service groups for third-party replication
      5.  
        Fire drill in replicated data clusters using third-party replication
  4. Section IV. Implementing global clusters
    1. Configuring global clusters for VCS and SFHA
      1.  
        Installing and Configuring Cluster Server
      2. Setting up VVR replication
        1.  
          About configuring VVR replication
        2.  
          Best practices for setting up replication
        3. Creating a Replicated Data Set
          1. Creating a Primary RVG of an RDS
            1.  
              Prerequisites for creating a Primary RVG of an RDS
            2.  
              Example - Creating a Primary RVG containing a data volume
            3.  
              Example - Creating a Primary RVG containing a volume set
          2. Adding a Secondary to an RDS
            1.  
              Best practices for adding a Secondary to an RDS
            2.  
              Prerequisites for adding a Secondary to an RDS
          3. Changing the replication settings for a Secondary
            1. Setting the mode of replication for a Secondary
              1.  
                Example - Setting the mode of replication to asynchronous for an RDS
              2.  
                Example - Setting the mode of replication to synchronous for an RDS
            2.  
              Setting the latency protection for a Secondary
            3.  
              Setting the SRL overflow protection for a Secondary
            4.  
              Setting the network transport protocol for a Secondary
            5. Setting the packet size for a Secondary
              1.  
                Example - Setting the packet size between the Primary and Secondary
            6. Setting the bandwidth limit for a Secondary
              1.  
                Example: Limiting network bandwidth between the Primary and the Secondary
              2.  
                Example: Disabling Bandwidth Throttling between the Primary and the Secondary
              3.  
                Example: Limiting network bandwidth used by VVR when using full synchronization
        4. Synchronizing the Secondary and starting replication
          1. Methods to synchronize the Secondary
            1.  
              Using the network to synchronize the Secondary
            2.  
              Using block-level tape backup to synchronize the Secondary
            3.  
              Moving disks physically to synchronize the Secondary
          2. Using the automatic synchronization feature
            1.  
              Notes on using automatic synchronization
          3.  
            Example for setting up replication using automatic synchronization
          4.  
            About SmartMove for VVR
          5.  
            About thin storage reclamation and VVR
          6.  
            Determining if a thin reclamation array needs reclamation
        5. Starting replication when the data volumes are zero initialized
          1.  
            Example: Starting replication when the data volumes are zero initialized
      3.  
        Setting up third-party replication
      4. Configuring clusters for global cluster setup
        1.  
          Configuring global cluster components at the primary site
        2.  
          Installing and configuring VCS at the secondary site
        3.  
          Securing communication between the wide-area connectors
        4.  
          Configuring remote cluster objects
        5.  
          Configuring additional heartbeat links (optional)
        6.  
          Configuring the Steward process (optional)
      5. Configuring service groups for global cluster setup
        1.  
          Configuring VCS service group for VVR-based replication
        2.  
          Configuring a service group as a global service group
      6.  
        Fire drill in global clusters
    2. Configuring a global cluster with Storage Foundation Cluster File System High Availability, Storage Foundation for Oracle RAC, or Storage Foundation for Sybase CE
      1.  
        About global clusters
      2.  
        About replication for parallel global clusters using Storage Foundation and High Availability (SFHA) Solutions
      3.  
        About setting up a global cluster environment for parallel clusters
      4.  
        Configuring the primary site
      5. Configuring the secondary site
        1.  
          Configuring the Sybase ASE CE cluster on the secondary site
      6.  
        Setting up replication between parallel global cluster sites
      7.  
        Testing a parallel global cluster configuration
    3. Configuring global clusters with VVR and Storage Foundation Cluster File System High Availability, Storage Foundation for Oracle RAC, or Storage Foundation for Sybase CE
      1.  
        About configuring a parallel global cluster using Volume Replicator (VVR) for replication
      2. Setting up replication on the primary site using VVR
        1.  
          Creating the data and SRL volumes on the primary site
        2.  
          Setting up the Replicated Volume Group on the primary site
      3. Setting up replication on the secondary site using VVR
        1.  
          Creating the data and SRL volumes on the secondary site
        2.  
          Editing the /etc/vx/vras/.rdg files
        3.  
          Setting up IP addresses for RLINKs on each cluster
        4.  
          Setting up the disk group on secondary site for replication
      4.  
        Starting replication of the primary site database volume to the secondary site using VVR
      5. Configuring Cluster Server to replicate the database volume using VVR
        1.  
          Modifying the Cluster Server (VCS) configuration on the primary site
        2.  
          Modifying the VCS configuration on the secondary site
        3.  
          Configuring the Sybase ASE CE cluster on the secondary site
      6.  
        Replication use cases for global parallel clusters
  5. Section V. Reference
    1. Appendix A. Sample configuration files
      1. Sample Storage Foundation for Oracle RAC configuration files
        1.  
          sfrac02_main.cf file
        2.  
          sfrac07_main.cf and sfrac08_main.cf files
        3.  
          sfrac09_main.cf and sfrac10_main.cf files
        4.  
          sfrac11_main.cf file
        5.  
          sfrac12_main.cf and sfrac13_main.cf files
        6.  
          Sample fire drill service group configuration
      2. About sample main.cf files for Storage Foundation (SF) for Oracle RAC
        1.  
          Sample main.cf for Oracle 10g for CVM/VVR primary site
        2.  
          Sample main.cf for Oracle 10g for CVM/VVR secondary site
      3. About sample main.cf files for Storage Foundation (SF) for Sybase ASE CE
        1.  
          Sample main.cf for a basic Sybase ASE CE cluster configuration under VCS control with shared mount point on CFS for Sybase binary installation
        2.  
          Sample main.cf for a basic Sybase ASE CE cluster configuration with local mount point on VxFS for Sybase binary installation
        3.  
          Sample main.cf for a primary CVM VVR site
        4.  
          Sample main.cf for a secondary CVM VVR site

How VCS campus clusters work

This topic describes how VCS works with VxVM to provide high availability in a campus cluster environment.

In a campus cluster setup, VxVM automatically mirrors volumes across sites. To enhance read performance, VxVM reads from the plexes at the local site where the application is running. VxVM writes to plexes at both the sites.

In the event of a storage failure at a site, VxVM detaches all the disks at the failed site from the disk group to maintain data consistency. When the failed storage comes back online, VxVM automatically reattaches the site to the disk group and recovers the plexes.

See the Storage Foundation Cluster File System High Availability Administrator's Guide for more information.

When service group or system faults occur, VCS fails over service groups based on the values you set for the cluster attribute SiteAware and the service group attribute AutoFailOver.

For campus cluster setup, you must define sites and add systems to the sites that you defined. A system can belong to only one site. Sit e definitions are uniform across VCS, You can define sites Arctera InfoScale Operations Manager, and VxVM. You can define site dependencies to restrict connected applications to fail over within the same site.

You can define sites by using:

  • Arctera InfoScale Operations Manager

    For more information on configuring sites, see the latest version of the Arctera InfoScale Operations Manager User guide.

Depending on the value of the AutoFailOver attribute, VCS failover behavior is as follows:

0

VCS does not fail over the service group.

1

VCS fails over the service group to another suitable node.

By default, the AutoFailOver attribute value is set to 1.

2

VCS fails over the service group if another suitable node exists in the same site. Otherwise, VCS waits for administrator intervention to initiate the service group failover to a suitable node in the other site.

This configuration requires the HA/DR license enabled.

Arctera recommends that you set the value of AutoFailOver attribute to 2.

Sample definition for these service group attributes in the VCS main.cf is as follows:

cluster VCS_CLUS (
        PreferredFencingPolicy = Site
        SiteAware = 1
        )
site MTV (
        SystemList = { sys1, sys2 }
        )
site SFO (
        Preference = 2
        SystemList = { sys3, sys4 }
        )

The sample configuration for hybrid_group with AutoFailover = 1 and failover_group with AutoFailover = 2 is as following:

hybrid_group (
    Parallel = 2
    SystemList = { sys1 = 0, sys2 = 1, sys3 = 2, sys4 = 3 }
)

failover_group (
    AutoFailover = 2
    SystemList = { sys1 = 0, sys2 = 1, sys3 = 2, sys4 = 3 }
)

Table: Failure scenarios in campus cluster lists the possible failure scenarios and how VCS campus cluster recovers from these failures.

Table: Failure scenarios in campus cluster

Failure

Description and recovery

Node failure

  • A node in a site fails.

    If the value of the AutoFailOver attribute is set to 1, VCS fails over the service group to another system within the same site defined for cluster or SystemZone defined by SystemZones attribute for the service group or defined by Arctera InfoScale Operations Manager.

  • All nodes in a site fail.

If the value of the AutoFailOver attribute is set to 0, VCS requires administrator intervention to initiate a fail over in both the cases of node failure.

Application failure

The behavior is similar to the node failure.

Storage failure - one or more disks at a site fails

VCS does not fail over the service group when such a storage failure occurs.

VxVM detaches the site from the disk group if any volume in that disk group does not have at least one valid plex at the site where the disks failed.

VxVM does not detach the site from the disk group in the following cases:

  • None of the plexes are configured on the failed disks.

  • Some of the plexes are configured on the failed disks, and at least one plex for a volume survives at each site.

If only some of the disks that failed come online and if the vxrelocd daemon is running, VxVM relocates the remaining failed disks to any available disks. Then, VxVM automatically reattaches the site to the disk group and resynchronizes the plexes to recover the volumes.

If all the disks that failed come online, VxVM automatically reattaches the site to the disk group and resynchronizes the plexes to recover the volumes.

Storage failure - all disks at both sites fail

VCS acts based on the DiskGroup agent's PanicSystemOnDGLoss attribute value.

See the Cluster Server Bundled Agents Reference Guide for more information.

Site failure

All nodes and storage at a site fail.

Depending on the value of the AutoFailOver attribute, VCS fails over the service group as follows:

  • If the value is set to 1, VCS fails over the service group to a system.

  • If the value is set to 2, VCS requires administrator intervention to initiate the service group failover to a system in the other site.

Because the storage at the failed site is inaccessible, VCS imports the disk group in the application service group with all devices at the failed site marked as NODEVICE.

When the storage at the failed site comes online, VxVM automatically reattaches the site to the disk group and resynchronizes the plexes to recover the volumes.

Network failure (LLT interconnect failure)

Nodes at each site lose connectivity to the nodes at the other site

The failure of all private interconnects between the nodes can result in split brain scenario and cause data corruption.

Review the details on other possible causes of split brain and how I/O fencing protects shared data from corruption.

Arctera recommends that you configure I/O fencing to prevent data corruption in campus clusters.

When the cluster attribute PreferredFencingPolicy is set as Site, the fencing driver gives preference to the node with higher site priority during the race for coordination points. VCS uses the site-level attribute Preference to determine the node weight.

Network failure (LLT and storage interconnect failure)

Nodes at each site lose connectivity to the storage and the nodes at the other site

Arctera recommends that you configure I/O fencing to prevent split brain and serial split brain conditions.

  • If I/O fencing is configured:

    The site that do not win the race triggers a system panic.

    When you restore the network connectivity, VxVM detects the storage at the failed site, reattaches the site to the disk group, and resynchronizes the plexes to recover the volumes.

  • If I/O fencing is not configured:

    If the application service group was online at site A during such failure, the application service group remains online at the same site. Because the storage is inaccessible, VxVM detaches the disks at the failed site from the disk group. At site B where the application service group is offline, VCS brings the application service group online and imports the disk group with all devices at site A marked as NODEVICE. So, the application service group is online at both the sites and each site uses the local storage. This causes inconsistent data copies and leads to a site-wide split brain.

    When you restore the network connectivity between sites, a serial split brain may exist.

    See the Storage Foundation Administrator's Guide for details to recover from a serial split brain condition.