Search <book_title>...

Important Update: Cohesity Products Documentation

All Cohesity product documentation are now managed via the Cohesity Docs Portal: https://docs.cohesity.com/HomePage/Content/home.htm. Some documentation available here may not reflect the latest information or may no longer be accessible.

Cluster Server 7.4.1 Administrator's Guide - Linux

Last Published: 2019-10-17

Product(s): InfoScale & Storage Foundation (7.4.1)

Platform: Linux

Section I. Clustering concepts and terminology
Section II. Administration - Putting VCS to work
Section III. VCS communication and operations
Section IV. Administration - Beyond the basics
Section V. Veritas High Availability Configuration wizard
1. Introducing the Veritas High Availability Configuration wizard
2. Administering application monitoring from the Veritas High Availability view
  1. Administering application monitoring from the Veritas High Availability view
  2. Administering application monitoring settings
Section VI. Cluster configurations for disaster recovery
Section VII. Troubleshooting and performance
1. VCS performance considerations
2. Troubleshooting and recovery for VCS
Section VIII. Appendixes

VCS performance consideration when a resource fails

The time it takes to detect a resource fault or failure depends on the MonitorInterval attribute for the resource type. When a resource faults, the next monitor detects it. The agent may not declare the resource as faulted if the ToleranceLimit attribute is set to non-zero. If the monitor function reports offline more often than the number set in ToleranceLimit, the resource is declared faulted. However, if the resource remains online for the interval designated in the ConfInterval attribute, previous reports of offline are not counted against ToleranceLimit.

When the agent determines that the resource is faulted, it calls the clean function (if implemented) to verify that the resource is completely offline. The monitor following clean verifies the offline. The agent then tries to restart the resource according to the number set in the RestartLimit attribute (if the value of the attribute is non-zero) before it gives up and informs HAD that the resource is faulted. However, if the resource remains online for the interval designated in ConfInterval, earlier attempts to restart are not counted against RestartLimit.

In most cases, ToleranceLimit is 0. The time it takes to detect a resource failure is the time it takes the agent monitor to detect failure, plus the time to clean up the resource if the clean function is implemented. Therefore, the time it takes to detect failure depends on the MonitorInterval, the efficiency of the monitor and clean (if implemented) functions, and the ToleranceLimit (if set).

In some cases, the failed resource may hang and may also cause the monitor to hang. For example, if the database server is hung and the monitor tries to query, the monitor will also hang. If the monitor function is hung, the agent eventually kills the thread running the function. By default, the agent times out the monitor function after 60 seconds. This can be adjusted by changing the MonitorTimeout attribute. The agent retries monitor after the MonitorInterval. If the monitor function times out consecutively for the number of times designated in the attribute FaultOnMonitorTimeouts, the agent treats the resource as faulted. The agent calls clean, if implemented. The default value of FaultOnMonitorTimeouts is 4, and can be changed according to the type. A high value of this parameter delays detection of a fault if the resource is hung. If the resource is hung and causes the monitor function to hang, the time to detect it depends on MonitorTimeout, FaultOnMonitorTimeouts, and the efficiency of monitor and clean (if implemented).