Storage Foundation and High Availability Solutions 7.4.2 Solutions Guide - Windows
- Section I. Introduction
- Introducing Storage Foundation and High Availability Solutions
- Using the Solutions Configuration Center
- SFW best practices for storage
- Section II. Quick Recovery
- Section III. High Availability
- High availability: Overview
- How VCS monitors storage components
- Deploying InfoScale Enterprise for high availability: New installation
- Notes and recommendations for cluster and application configuration
- Configuring disk groups and volumes
- Configuring the cluster using the Cluster Configuration Wizard
- About modifying the cluster configuration
- About installing and configuring the application or server role
- Configuring the service group
- About configuring file shares
- About configuring IIS sites
- About configuring applications using the Application Configuration Wizard
- About configuring the Oracle service group using the wizard
- Modifying the application service groups
- Adding DMP to a clustering configuration
- High availability: Overview
- Section IV. Campus Clustering
- Introduction to campus clustering
- Deploying InfoScale Enterprise for campus cluster
- Notes and recommendations for cluster and application configuration
- Reviewing the configuration
- Configuring the cluster using the Cluster Configuration Wizard
- Creating disk groups and volumes
- Installing the application on cluster nodes
- Section V. Replicated Data Clusters
- Introduction to Replicated Data Clusters
- Deploying Replicated Data Clusters: New application installation
- Notes and recommendations for cluster and application configuration
- Configuring the cluster using the Cluster Configuration Wizard
- Configuring disk groups and volumes
- Installing and configuring the application or server role
- Configuring the service group
- About configuring file shares
- About configuring IIS sites
- About configuring applications using the Application Configuration Wizard
- Configuring a RVG service group for replication
- Configuring the resources in the RVG service group for RDC replication
- Configuring the VMDg or VMNSDg resources for the disk groups
- Configuring the RVG Primary resources
- Adding the nodes from the secondary zone to the RDC
- Verifying the RDC configuration
- Section VI. Disaster Recovery
- Disaster recovery: Overview
- Deploying disaster recovery: New application installation
- Notes and recommendations for cluster and application configuration
- Reviewing the configuration
- About managing disk groups and volumes
- Setting up the secondary site: Configuring SFW HA and setting up a cluster
- Setting up your replication environment
- About configuring disaster recovery with the DR wizard
- Installing and configuring the application or server role (secondary site)
- Configuring replication and global clustering
- Configuring the global cluster option for wide-area failover
- Possible task after creating the DR environment: Adding a new failover node to a Volume Replicator environment
- Maintaining: Normal operations and recovery procedures (Volume Replicator environment)
- Testing fault readiness by running a fire drill
- About the Fire Drill Wizard
- Prerequisites for a fire drill
- Preparing the fire drill configuration
- Deleting the fire drill configuration
- Section VII. Microsoft Clustering Solutions
- Microsoft clustering solutions overview
- Deploying SFW with Microsoft failover clustering
- Tasks for installing InfoScale Foundation or InfoScale Storage for Microsoft failover clustering
- Creating SFW disk groups and volumes
- Implementing a dynamic quorum resource
- Deploying SFW with Microsoft failover clustering in a campus cluster
- Reviewing the configuration
- Establishing a Microsoft failover cluster
- Tasks for installing InfoScale Foundation or InfoScale Storage for Microsoft failover clustering
- Creating disk groups and volumes
- Implementing a dynamic quorum resource
- Installing the application on the cluster nodes
- Deploying SFW and VVR with Microsoft failover clustering
- Part 1: Setting up the cluster on the primary site
- Reviewing the prerequisites and the configuration
- Part 2: Setting up the cluster on the secondary site
- Part 3: Adding the Volume Replicator components for replication
- Part 4: Maintaining normal operations and recovery procedures
- Section VIII. Server Consolidation
- Server consolidation overview
- Server consolidation configurations
- Typical server consolidation configuration
- Server consolidation configuration 1 - many to one
- Server consolidation configuration 2 - many to two: Adding clustering and DMP
- About this configuration
- SFW features that support server consolidation
Campus cluster failure with Microsoft clustering scenarios
This section focuses on the failure and recovery scenarios with a campus cluster with Microsoft clustering and SFW installed.
For information about the quorum resource and arbitration in Microsoft clustering:
See Microsoft clustering quorum and quorum arbitration.
The following table lists failure situations and the outcomes that occur.
Table: List of failure situations and possible outcomes
Failure situation | Outcome | Comments |
---|---|---|
Application fault May mean the services stopped for an application, a NIC failed, or a database table went offline | Failover | If the services stop for an application failure, the application automatically fails over to the other site. |
Server failure (Site A) May mean that a power cord was unplugged, a system hang occurred, or another failure caused the system to stop responding | Failover | Assuming a two-node cluster pair, failing a single node results in a cluster failover. Service is temporarily interrupted for cluster resources that are moved from the failed node to the remaining live node. |
Server failure (Site B) May mean that a power cord was unplugged, a system hang occurred, or another failure caused the system to stop responding | No interruption of service | Failure of the passive site (Site B) does not interrupt service to the active site (Site A). |
Partial SAN network failure May mean that SAN fiber channel cables were disconnected to Site A or Site B Storage | No interruption of service | Assuming that each of the cluster nodes has some type of Dynamic Multi-Pathing (DMP) solution, removing one SAN fiber cable from a single cluster node should not effect any cluster resources running on that node, because the underlying DMP solution should seamlessly handle the SAN fiber path failover. |
Private IP Heartbeat Network Failure May mean that the private NICs or the connecting network cables failed | No interruption of service | With the standard two-NIC configuration for a cluster node, one NIC for the public cluster network and one NIC for the private heartbeat network, disabling the NIC for the private heartbeat network should not effect the cluster software and the cluster resources, because the cluster software simply routes the heartbeat packets through the public network. |
Public IP Network Failure May mean that the public NIC or LAN network has failed |
| When the public NIC on the active node, or public LAN fails, clients cannot access the active node, and failover occurs. |
Public and Private IP or Network Failure May mean that the LAN network, including both private and public NIC connections, has failed |
| The site that owned the quorum resource right before the "network partition" remains the owner of the quorum resource, and is the only surviving cluster node. The cluster software running on the other cluster node self-terminates because it has lost the cluster arbitration for the quorum resource. |
Lose Network Connection (SAN & LAN), failing both heartbeat and connection to storage May mean that all network and SAN connections are severed; for example, if a single pipe is used between buildings for the Ethernet and storage |
| The node/site that owned the quorum resource right before the "network partition" remains the owner of the quorum resource, and is the only surviving cluster node. The cluster software running on the other cluster node self-terminates because it has lost the cluster arbitration for the quorum resource. By default, the Microsoft clustering clussvc service tries to auto-start every minute, so after LAN/SAN communication has been re-established, the Microsoft clustering clussvc auto-starts and will be able to re-join the existing cluster. |
Storage Array failure on Site A, or on Site B May mean that a power cord was unplugged, or a storage array failure caused the array to stop responding |
| The campus cluster is divided equally between two sites with one array at each site. Completely failing one storage array should have no effect on the cluster or any cluster resources that are online. However, you cannot move any cluster resources between nodes after this storage failure, because neither node will be able to obtain a majority of disks within the cluster disk group. |
Site A failure (power) Means that all access to site A, including server and storage, is lost | Manual failover | If the failed site contains the cluster node that owned the quorum resource, then the overall cluster is offline and cannot be onlined on the remaining live site without manual intervention. |
Site B failure (power) Means that all access to site B, including server and storage, is lost |
| If the failed site did not contain the cluster node that owned the quorum resource, then the cluster is still alive with whatever cluster resources that were online on that node right before the site failure. |