Storage Foundation and High Availability 8.0.2 Solutions Microsoft Clustering Solutions Guide for Microsoft SQL Server - Windows

Last Published:
Product(s): InfoScale & Storage Foundation (8.0.2)
Platform: Windows
  1. Introducing SFW solutions for a Microsoft cluster
    1.  
      About Microsoft clustering solutions with SFW
    2.  
      Advantages of using SFW in a Microsoft cluster
    3.  
      About high availability clusters
    4.  
      About campus clusters
    5.  
      About disaster recovery clusters
  2. Planning for deploying SQL Server with SFW in a Microsoft cluster
    1.  
      InfoScale requirements for Microsoft clustering solutions
    2. Planning your SQL Server high availability configuration
      1.  
        Sample high availability configuration for SQL Server with SFW
      2.  
        Configuring the quorum device for high availability
    3. Planning your campus cluster configuration
      1.  
        Microsoft campus cluster failure scenarios
      2. Microsoft cluster quorum and quorum arbitration
        1.  
          Quorum
        2.  
          Cluster ownership of the quorum resource
        3.  
          The vxclus utility
    4. Planning your disaster recovery configuration
      1.  
        Sample disaster recovery configuration for SQL Server with SFW and Volume Replicator
  3. Workflows for deploying SQL Server with SFW in a Microsoft cluster
    1.  
      Workflow for a high availability (HA) configuration
    2. Workflow for a campus cluster configuration
      1.  
        Campus cluster: Connecting the two nodes
    3.  
      Workflow for a disaster recovery configuration
    4.  
      Using the Solutions Configuration Center workflow
    5.  
      Configuring the storage hardware and network
  4. Configuring SFW storage
    1.  
      Tasks for configuring InfoScale Storage
    2. Planning for SFW cluster disk groups and volumes
      1.  
        Sample SQL Server high-availability cluster storage configuration
      2.  
        Sample campus cluster storage configuration
      3.  
        Sample SQL Server disaster recovery storage configuration
    3.  
      Considerations when creating disk groups and volumes for a campus cluster
    4.  
      Considerations when creating volumes for a DR configuration using Volume Replicator replication
    5.  
      Viewing the available disk storage
    6.  
      Creating dynamic cluster disk groups
    7.  
      Adding disks to campus cluster sites
    8.  
      Creating dynamic volumes for high availability clusters
    9.  
      Creating dynamic volumes for campus clusters
  5. Implementing a dynamic mirrored quorum resource
    1.  
      Tasks for implementing a dynamic mirrored quorum resource
    2.  
      Creating a dynamic cluster disk group and a mirrored volume for the quorum resource
    3.  
      Adding a Volume Manager Disk Group resource for the quorum
    4.  
      Changing the quorum resource to a dynamic mirrored quorum resource
  6. Installing SQL Server and configuring resources
    1.  
      Tasks for installing and configuring SQL Server
    2.  
      Creating the resource group for the SQL Server instance
    3.  
      Prerequisites for installing SQL Server
    4.  
      Installing SQL Server in an InfoScale Storage environment
    5.  
      Dependency graph for SQL Server
    6.  
      Verifying the SQL Server group in the Microsoft cluster
  7. Configuring disaster recovery
    1.  
      Tasks for configuring the secondary site for disaster recovery for SQL Server
    2.  
      Verifying the primary site configuration
    3.  
      Creating a parallel environment for SQL Server on the secondary site
    4.  
      Volume Replicator components overview
    5.  
      Setting up security for Volume Replicator
    6.  
      Creating resources for Volume Replicator
    7. Configuring Volume Replicator: Setting up an RDS
      1.  
        Prerequisites for setting up the RDS
      2.  
        Creating a Replicated Data Set (RDS)
    8.  
      Creating the RVG resource
    9.  
      Setting the SQL server resource dependency on the RVG resource
    10. Normal Volume Replicator operations and recovery procedures
      1.  
        Monitoring the status of the replication
      2.  
        Performing planned migration
      3. Replication recovery procedures
        1.  
          Bringing up the application on the secondary host
        2.  
          Restoring the primary host
  8. Appendix A. Configure InfoScale Storage in an existing Microsoft Failover Cluster
    1.  
      Configuring InfoScale Storage in an existing Microsoft Failover Cluster

Microsoft campus cluster failure scenarios

Different failure and recovery scenarios can occur with a Microsoft campus cluster and InfoScale Storage installed.

The site scenarios that can occur when there is a cluster server failure include the following:

  • If the site not owning the quorum volume and the cluster goes offline, the quorum and data volumes stay online at the other site and other cluster resources stay online or move to that site. Storage Foundation lets the owning cluster node remain online with 50% ownership of the disks in the quorum group.

  • If the site owning the quorum volume goes offline, the remaining site cannot gain control of the quorum volume because it cannot reserve a majority of disks in the quorum group. This is a safeguard to prevent multiple nodes from bringing online members of a cluster disk group to which they have access.

Manual failover of a cluster between two sites should be performed only after coordination between the two sites to ensure that the primary server has in fact failed. If the primary server is still active and you manually import a cluster disk group containing the cluster quorum to the secondary (failover) server, a split-brain situation occurs. There may be data loss if the split-brain situation occurs because each plex of the mirrored volume may be updated independently when the same disk group is imported on both nodes.

For additional details on the manual failover scenario, see the following topic:

See Microsoft cluster quorum and quorum arbitration.

The following table lists failure situations and the outcomes that occur.

Table: List of failure situations and possible outcomes

Failure Situation

Outcome

Comments

Application fault

May mean the services stopped for an application, a NIC failed, or a database table went offline.

Failover

If the services stop for an application failure, the application automatically fails over to the other site.

Server failure (Site A)

May mean that a power cord was unplugged, a system hang occurred, or another failure caused the system to stop responding.

Failover

Assuming a two-node cluster pair, failing a single node results in a cluster failover. There will be a temporary service interruption for cluster resources that are moved from the failed node to the remaining live node.

Server failure (Site B)

May mean that a power cord was unplugged, a system hang occurred, or another failure caused the system to stop responding.

No interruption of service.

Failure of the passive site (Site B) does not interrupt service to the active site (Site A).

Partial SAN network failure

May mean that SAN fiber channel cables were disconnected to Site A or Site B Storage.

No interruption of service.

Assuming that each of the cluster nodes has some type of Dynamic Multi-Pathing (DMP) solution, removing one SAN fiber cable from a single cluster node should not effect any cluster resources running on that node, because the underlying DMP solution should seamlessly handle the SAN fiber path failover.

Private IP Heartbeat Network Failure

May mean that the private NICs or the connecting network cables failed.

No interruption of service.

With the standard two-NIC configuration for a cluster node, one NIC for the public cluster network and one NIC for the private heartbeat network, disabling the NIC for the private heartbeat network should not effect the cluster software and the cluster resources, because the cluster software will simply route the heartbeat packets through the public network.

Public IP Network Failure

May mean that the public NIC or LAN network has failed.

Failover.

Mirroring continues.

When the public NIC on the active node, or public LAN fails, clients cannot access the active node, and failover occurs.

Public and Private IP or Network Failure

May mean that the LAN network, including both private and public NIC connections, has failed.

No interruption of service. No Public LAN access.

Mirroring continues.

The site that owned the quorum resource right before the "network partition" remains as owner of the quorum resource, and is the only surviving cluster node. The cluster software running on the other cluster node self-terminates because it has lost the cluster arbitration for the quorum resource.

Lose Network Connection (SAN & LAN), failing both heartbeat and connection to storage

May mean that all network and SAN connections are severed, for example if a single pipe is used between buildings for the Ethernet and storage.

No interruption of service. Disks on the same node are functioning. Mirroring is not working.

The node/site that owned the quorum resource right before the "network partition" remains as owner of the quorum resource, and is the only surviving cluster node. The cluster software running on the other cluster node self-terminates because it has lost the cluster arbitration for the quorum resource. By default Microsoft clustering clussvc service will try to auto-start every minute, so after LAN/SAN communication has been re-established, Microsoft clustering clussvc will auto-start and will be able to re-join the existing cluster.

Storage Array failure on Site A, or on Site B

May mean that a power cord was unplugged, or a storage array failure caused the array to stop responding.

No interruption of service. Disks on the same node are functioning. Mirroring is not working.

The campus cluster is divided equally between two sites with one array at each site. Completely failing one storage array should not effect on the cluster or any cluster resources that are currently online. However, you will not be able to move any cluster resources between nodes after this storage failure, because neither node will be able to obtain a majority of disks within the cluster disk group.

Site A failure (power)

Means that all access to site A, including server and storage, is lost.

Manual failover.

If the failed site contains the cluster node that owned the quorum resource, then the overall cluster would be offline and cannot be brought online on the remaining live site without manual intervention.

Site B failure (power)

Means that all access to site B, including server and storage, is lost.

No interruption of service. Disks on the same node are functioning. Mirroring is not working.

If the failed site did not contain the cluster node that owned the quorum resource, then the cluster would still be alive with whatever cluster resources that were online on that node right before the site failure.