Cluster Server 7.4 Agent for EMC SRDF Configuration Guide - Windows

Last Published:
Product(s): InfoScale & Storage Foundation (7.4.1)
Platform: Windows
  1. Introducing the agent for EMC SRDF
    1.  
      About the agent for EMC SRDF
    2.  
      Supported hardware for EMC SRDF
    3.  
      Supported software
    4.  
      Typical EMC SRDF setup in a VCS cluster
    5. EMC SRDF agent functions
      1.  
        About the EMC SRDF agent's online function
      2.  
        About dynamic swap support for the EMC SRDF agent
    6.  
      Installing the agent for EMC SRDF
  2. Configuring the agent for EMC SRDF
    1. Configuration concepts for the EMC SRDF agent
      1.  
        Resource type definition for the EMC SRDF agent
      2. Attribute definitions for the SRDF agent
        1.  
          Required attributes
        2.  
          Optional attributes
        3.  
          Internal attributes
      3.  
        Sample configuration for the EMC SRDF agent
    2. Before you configure the agent for EMC SRDF
      1.  
        About cluster heartbeats
      2.  
        About configuring system zones in replicated data clusters
      3.  
        About preventing split-brain
    3. Configuring the agent for EMC SRDF
      1. Configuring the agent manually in a global cluster
        1.  
          Configuring the Symm heartbeat on each cluster
      2.  
        Configuring the agent manually in a replicated data cluster
      3.  
        Setting the OnlineTimeout attribute for the SRDF resource
      4.  
        Additional configuration considerations for the SRDF agent
  3. Testing VCS disaster recovery support with EMC SRDF
    1. How VCS recovers from various disasters in an HA/DR setup with EMC SRDF
      1.  
        Failure scenarios in global clusters
      2.  
        Failure scenarios in replicated data clusters
    2.  
      Testing the global service group migration
    3.  
      Testing disaster recovery after host failure
    4.  
      Testing disaster recovery after site failure
    5.  
      Performing failback after a node failure or an application failure
    6.  
      Performing failback after a site failure
  4. Setting up fire drill
    1.  
      About fire drills
    2.  
      Fire drill configurations
    3. About the SRDFSnap agent
      1.  
        SRDFSnap agent functions
      2.  
        Resource type definition for the SRDFSnap agent
      3.  
        Attribute definitions for the SRDFSnap agent
      4.  
        About the Snapshot attributes
      5.  
        Sample configuration for a fire drill service group
    4.  
      Additional considerations for running a fire drill
    5.  
      Before you configure the fire drill service group
    6. Configuring the fire drill service group
      1.  
        About the Fire Drill wizard
    7.  
      Verifying a successful fire drill

Failure scenarios in replicated data clusters

The following table lists the failure scenarios in a replicated data cluster configuration, and describes the behavior of VCS and the agent in response to the failure.

Table: Failure scenarios in a replicated data cluster configuration with VCS agent for EMC SRDF

Failure

Description and VCS response

Application failure

Application cannot start successfully on any hosts at the primary site.

VCS response:

  • Causes the service group at the primary site to fault.

  • Does the following based on the AutoFailOver attribute for the faulted service group:

    • 1 - VCS automatically brings the faulted service group online at the secondary site.

    • 2 - You must bring the service group online at the secondary site.

The agent write enables the devices at the secondary site.

For dynamic RDF devices, the agent does the following if the value of the SwapRoles attribute of the SRDF resource is 1:

  • Swaps the R1/R2 personality of each device in the device group or the consistency group.

  • Restarts replication from R1 devices on the secondary site to the R2 devices at the primary site.

See Performing failback after a node failure or an application failure.

Host failure

All hosts at the primary site fail.

VCS response:

  • Causes the service group at the primary site to fault.

  • Does the following based on the AutoFailOver attribute for the faulted service group:

    • 1 - VCS automatically brings the faulted service group online at the secondary site.

    • 2 - You must bring the service group online at the secondary site.

The agent write enables the devices at the secondary site.

For dynamic RDF devices, if the value of the SwapRoles attribute of the SRDF resource is 1, the agent does the following:

  • Swaps the R1/R2 personality of each device in the device group or the consistency group.

  • Restarts replication from R1 devices on the secondary site to the R2 devices at the primary site.

See Performing failback after a node failure or an application failure.

Site failure

All hosts and the storage at the primary site fail.

A site failure renders the devices on the array at the secondary site in the PARTITIONED state.

VCS response:

  • Causes the service group at the primary site to fault.

  • Does the following based on the AutoFailOver attribute for the faulted service group:

    • 1 - VCS automatically brings the faulted service group online at the secondary site.

    • 2 - You must bring the service group online at the secondary site.

Agent response: The agent does the following based on the AutoTakeover attribute of the SRDF resource:

  • 1 - If invalid tracks do not exist, the agent issues the symrdf failover command to make the SRDF devices write-enabled.

  • 0 - The agent faults the SRDF resource.

See Performing failback after a site failure.

Replication link failure

Replication link between the arrays at the two sites fails.

A replication link failure renders the SRDF devices in the PARTITIONED state. When the link is restored, the SRDF devices attain the SUSPENDED state.

VCS response: No action.

Agent response: No action. The agent does not monitor the replication link status and cannot detect link failures.

After the link is restored, you must resynchronize the SRDF devices.

To resynchronize the SRDF devices after the link is restored:

  1. Before you resync the R2 device, you must split the BCV or target device from the R2 device at the secondary site.

  2. You must initiate resync of R2 device using the update action entry point.

  3. After R1 and R2 devices are in sync, reestablish the mirror relationship between the BCV or target devices and R2 devices.

If you initiate a failover to the secondary site when resync is in progress, the online function of the EMC SRDF agent waits for the resync to complete and then initiates a takeover of the R2 devices.

Note:

If you did not configure BCV or target devices and if disaster occurs when resync is in progress, then the data at the secondary site becomes inconsistent. Veritas recommends configuring BCV or target devices at both the sites.

See Typical EMC SRDF setup in a VCS cluster.

Network failure

The LLT and the replication links between the sites fail.

VCS response:

  • VCS at each site concludes that the nodes at the other site have faulted.

  • Does the following based on the AutoFailOver attribute for the faulted service group:

    • 2 - No action. You must confirm the cause of the network failure from the cluster administrator at the remote site and fix the issue.

    • 1 - VCS brings the service group online at the secondary site which leads to a cluster-wide split brain. This causes data divergence between the devices on the arrays at the two sites.

      When the network (LLT and replication) connectivity is restored, VCS takes all the service groups offline on one of the sites and restarts itself. This action eliminates concurrency violation where in the same group is online at both the sites.

      After taking the service group offline, you must manually resync the data using the symrdf establish or the symrdf restore command.

      Note:

      Veritas recommends that the value of the AutoFailOver attribute is set to 2 for all service groups to prevent unintended failovers due to transient network failures.

To resynchronize the data after the network link is restored:

  1. Take the service groups offline at both the sites.

  2. Manually resynchronize the data.

    Depending on the site whose data you want to retain use the symrdf establish or the symrdf restore command.

  3. Bring the service group online on one of the sites.

Agent response: Similar to the site failure.

Storage failure

The array at the primary site fails.

A storage failure at the primary site renders the devices on the array at the secondary site in the PARTITIONED state.

VCS response:

  • Causes the service group at the primary site to fault and displays an alert to indicate the fault.

  • Does the following based on the AutoFailOver attribute for the faulted service group:

    • 1 - VCS automatically brings the faulted service group online at the secondary site.

    • 2 - You must bring the service group online at the secondary site.

Agent response: The agent does the following based on the AutoTakeover attribute of the SRDF resource:

  • 1 - If invalid tracks do not exist, the agent issues the symrdf failover command to make the SRDF devices write-enabled.

  • 0 - The agent does not perform failover to the secondary site.