Problem
This article attempts to outline how Veritas DMP handle I/O failures on Linux (RedHat).
DMP: I/O JOURNEY
The general flow of I/O through a multi-pathed stack can be summarized in the diagram below:
Applications that run on top of the file system typically send I/O requests to files while databases fire them directly on raw devices. In the former case, based upon the pathname of file, the I/O will then be directed to the relevant file system.
The file system will intercept this request and channel its own I/O to the volume device that it resides on: /dev/vx/[r]dsk/diskgroup_name/volume_name.
As the I/O passes through the VxVM kernel driver, VxIO, receives them.
The VxIO driver maintains the volume-plex-subdisk-disk configuration in the kernel.
For a given I/O, VxIO will ascertain from the in-memory volume configuration, which disk(s) the I/O is destined to be serviced by, and sends the I/O (buffer) to the relevant DMP METANODE for that disk device.
The DMP METANODE is a pseudo device located in /dev/vx/[r]dmp and is a representation of the disk with all its paths.
When the I/O is directed at the DMP METANODE device, the DMP kernel module, VxDMP, handles it.
The VxDMP driver creates its own buffer to service the I/O and piggybacks the incoming (VxIO) I/O on this buffer.
DMP will select one of the sub-paths to send the I/O and passes the buffer to the disk driver instance for that path.
The buffer will include the LBOLT value (the number of clock ticks since boot time) at the time of firing the I/O to the disk driver and also the number of times the I/O has been retried.
OS: SCSI layer
DMP will now wait on the I/O as it leaves its domain and enters the underlying SCSI disk driver.
The SCSI disk driver (sd, ssd etc) will now process the I/O and send it to the relevant HBA driver which eventually sends it across to the relevant disk/LUN.
On completion of the I/O, the BUFFER returns all the way up the stack back to the calling system call that initiated the I/O.
DMP: I/O FAILURE HANDLING
When an I/O TIMEOUT occurs the Linux kernel SCSI error handler logic proceeds through a sequence of recovery methods to attempt to recover failing devices or transports while causing as little disruption to other I/O taking place on the system as possible.
The standard recovery levels are executed in order with an escalation to the next level whenever a recover attempt fails, or a subsequent SCSI Test Unit Ready (TUR) command fails:
• Abort timed out commands and attempt to bring device online
• Issue SCSI device Reset task management function for each failing device
• Issue SCSI target reset task management function for each failing target
• Issue SCSI bus reset for each failing bus (emulated as a series of port resets for Fibre Channel environments)
• Issue SCSI host bus adapter (HBA) reset
Each level of escalation broadens the scope of the recovery attempt and so increases the number of other I/O requests and devices that may be affected by the recovery action.
By starting with simple command aborts and finally escalating to a full HBA reset the methods are tried in order of increasing cost/disruption.
In a situation where all operations on the external storage time out (i.e. due to a failed SAN fabric component that has ceased to pass any traffic or report any error condition) very long delays can be experienced in failing I/O.
Especially where there are large numbers of devices or targets (since each reset level is repeated for each outstanding command, device, target etc.).
Guidelines for setting the DMP timebound iotimeout attribute with RedHat
KEY POINTS:
DMP will fail the I/O only when the returned I/O reaches DMP.
Once the SCSI layer failure updates DMP, only then will DMP attempt to re-issue the failed I/O against an alternate available path, as long as time still permits before the defined iotimeout is reached or when the number of fixedretries has not been exhausted.
DMP cannot react to a hung I/O in the lower layers unless a response is given back to DMP.
Recovery options for retrying I/O after an error
-->
Recovery option |
Possible settings |
Description |
recoveryoption = timebound |
Timebound ( iotimeout ) |
DMP retries a failed I/O request for the specified time in seconds if I/O fails. The failed IO is retried on other paths where it is most likely to succeed.
|
recoveryoption = fixedretry |
Fixed-Retry ( retrycount ) |
DMP retries a failed I/O request for the specified number of times if I/O fails. |
The timebound method and the iotimeout attribute specify the defined amount of time DMP allows for handling an I/O failure.
Note: If the lower layers take too long to respond back to DMP, and the iotimeout value is set too low.
The DMP timebound iotimeout value can already be exhausted before even a single retry of the I/O down an alternate path is even attempted.
DMP ERROR-RETRY (FIXED-RETRY RETRYCOUNT)
By default, DMP is configured to retry a failed I/O request up to five minutes (300 seconds) on various paths of a DMP device. The settings for handling I/O request failures can be applied to the paths to an enclosure, array name, or array type.
The recovery options are specified as two types: retrying I/O after an error, or throttling I/O.
You can use the Fixed-Retry method and the retrycount attribute to specify the number of retries to be attempted before DMP returns the failed I/O to the upper layers.
The retrycount limits the number of times that DMP retries an I/O request for the DMPNODE. The retrycount is not specific to a path, but specific to I/O.
DMP will try the I/O on a single path initially and if then required proceed to retry the I/O on other available paths until the retrycount is exhausted for a particular IO.
As DMP retries the I/O against other available paths for the defined amount of retries, this can be a very time consuming exercise.
DMP ERROR-RETRY (TIMEBOUND IOTIMEOUT)
As an alternative to specifying a fixed number of retries, you can use the timebound method and the iotimeout attribute to specify the amount of time DMP allows for handling an I/O request.
If the I/O request does not succeed within that time, DMP fails the I/O request.
The default value of iotimeout is 300 seconds.
Syntax:
# vxdmpadm setattr enclosure <enclosure-name> recoveryoption=timebound iotimeout=160
The iotimeout value for DMP should always be greater than the I/O service time of the underlying operating system layers.
Note: The fixedretry and timebound settings are mutually exclusive.
REDHAT: I/O SERVICE TIME LIMIT
By setting an overall limit on the time spent by the lower layers, a more consistent and predictable system behaviour can be enforced.
Once time expires the lower layers can also instruct an immediate HBA reset when faults of this nature occur.
If the I/O request does not succeed within the defined DMP iotimeout, DMP fails the I/O request only after the lower layers have responded back to DMP.
When defining the DMP iotimeout value, it is essential to know how much time will be spent by the lower layers.
Example:
Device timeout (30 seconds by default)
Max of 5 retries (30 seconds intervals)
HBA Reset (14 seconds by default)
Total I/O service timeout of 194 seconds (includes 30 + 150 + 14 seconds HBA reset)
DMP: I/O SERVICE TIMEOUTS WITH FIXED-RETRY
DMP will have to wait for the lower layers to respond each time before retrying I/O down an alternate (enabled) path. The overall SCSI I/O service limit is set to 134 seconds in this instance.
DMP Fixedretry count: 2
SCSI I/O timeout = 134 seconds for the lowers layers to fail the I/O
With Fixed-Retry DMP will retry the I/O 2 times down as many different enabled paths, DMP may potentially reuse a previous path as long as it is in an enabled state.
Therefore, as the SCSI I/O timeout is 134 seconds, DMP may not fail the I/O until after 402 seconds (134 x 3) seconds.
DMP: I/O SERVICE TIMEOUTS WITH TIMEBOUND
DMP will have to wait for the lower layers to respond each time before retrying I/O down an alternate (enabled) path. The overall SCSI I/O service limit is set to 134 seconds in this instance.
DMP timebound iotimeout: 180
SCSI I/O timeout = 134 seconds for the lowers layers to fail the I/O
With timebound DMP will retry the I/O as many times down as many different enabled paths until the DMP defined timeout is exhausted.
Therefore, as the SCSI I/O timeout is 134 seconds, DMP may not fail the I/O until after 268 seconds (134 x 2) seconds.
The first I/O attempt is completed after 134 seconds, as DMP still has 46 seconds left, the I/O is retried down an alternate path, this also fails although after the DMP timeout of 180 seconds. DMP cannot fail the I/O until after 268 seconds due to the time required by the lower layers.
REDHAT: I/O ERROR HANDLING INSIGHT
When the I/O is sent, the scsi_timeout starts. Once the initial timeout is reached (normally 30 seconds), the eh_deadline starts.
If the SCSI error handling doesn't complete before the eh_deadline timeout (if not defined as 0 by default) for aborts and device/target/bus resets, the SCSI layer performs a HBA reset as soon as possible.
The SCSI layer still needs to wait for whatever operation may be in progress, which is why the lpfc_task_mgmt_tmo value is important.
The driver then resets the HBA, and there is a 10 second delay in the error handler during which the HBA hopefully rediscovers its now lost remote ports.
Lastly, there is the sending of TUR (Test Unit Ready) commands to the LUN with timed out commands.
The eh_deadline is the over-all SCSI error handler deadline.
If this expires, the SCSI immediately reset the HBA port , and commands are free to retry on another path. The HBA port reset takes approximately 10 seconds.
REDHAT: HBA RESET AND SCSI DEVICE TIMEOUT
The HBA reset depends on the module and the specific HBA in use, but it is generally accepted that 10 seconds is enough.
Therefore the overall SCSI service timeout can be: 30s + 90s + 10s + 2s + 2s = 134s
The scsi_timeout can be lower than 30 seconds, however, consult the OS vendor prior to setting. In RHEL6 it was reduced to 30 seconds to allow for faster error detection and hence faster recovery from failure of I/O.
How to see what is set:
# cat /sys/block/*/device/scsi_device/*/device/timeout | uniq
30
Limiting path failover time for SCSI devices
https://access.redhat.com/solutions/627903
From where does default 'scsi timeout' value get set for scsi devices ?
https://access.redhat.com/solutions/498733
REDHAT: I/O ERROR HANDLING FAILURES
The eh_deadline starts after the scsi_timeout, the scsi_timeout is relevant in the overall time the OS needs to declare the path as down.
The longer the eh_deadline timer is, the more chances the intermediate error handling steps have to retry their operations and commands.
An I/O command is sent to the block device, which has a 30 seconds max timeout.
If this expires, then the SCSI layer initiate's the error handling steps and starts the eh_deadline countdown from 90 seconds (if defined).
All the intermediate steps/timers/retries start to happen.
Once the eh_deadline expires, the HBA port is reset which takes around 10 seconds.
As the HBA wakes up, the port may still be still down, so a further 2 seconds is required to verify and a further 2 seconds to report the status.
DMP: SCSI INQUIRY PROBE
When the lower layers return back to DMP against the original path, DMP will send a SCSI INQUIRY PROBE to validate the stability of the problematic path, whilst is I/O retired against an alternate path.
DMP will take the decision to disable the problematic path based solely on the results of the SCSI INQUIRY PROBE.
The problematic path will remain in an enabled state during the SCSI probe event. I/O will not go via the problematic path since
DMP will never the disable the path unless DMP gets a response from SCSI probe.
In the meantime, we will set a specific FLAG on that path, which will make sure that no further I/Os are sent on the path. Whether DMP needs to disable the path, the decision will be taken based on the output of the SCSI probe.
DMP: DMP_SCSI_TIMEOUT TUNABLE
The SCSI probe is sent via the DMP bypass framework. DMP will wait up to 20 seconds (by default) as defined by the DMP tunable “dmp_scsi_timeout" for the SCSI probe feedback.
# vxdmpadm gettune dmp_scsi_timeout
Tunable Current Value Default Value
------------------------------ ------------- -------------
dmp_scsi_timeout 20 20
The SCSI probe action informs DMP of the PATH status, thus no further retry is needed, unlike for I/Os.
- If the SCSI probe returns the path status as bad/timeout, DMP will fail the problematic path.
- If the probe action returns with success, DMP will continue to send future I/O via this path.
The DMP SCSI INQUIRY PROBE denotes what happens to the state of the problematic path.
Device timeout (30 seconds by default)
Max of 5 retries (30 seconds intervals)
HBA Reset (14 seconds by default)
Total I/O service timeout of 194 seconds (includes 30 + 150 + 14 seconds HBA reset)
REDHAT: LIMITING SCSI ERROR HANDLING TIME
RedHat Enterprise Linux 5.10.z and 6.4.z includes several new SCSI and device driver parameters that can help to limit the overall time spent in SCSI error handling.
Setting these parameters to appropriate values may greatly reduce the time taken for the kernel to FAIL I/O on a defective path before DMP can re-issue I/O on other available device paths.
• eh_deadline (overall SCSI error handler deadline)
• Red Hat Enterprise Linux 5.10.z kernel kernel-2.6.18-371.6.1.el5 or newer
• Red Hat Enterprise Linux 6.4.z kernel kernel-2.6.32-358.32.3.el6 or newer
• SCSI devices in a multipath configuration
• Storage faults that cause IO to time out on external storage
REDHAT: OVERVIEW OF eh_deadline PARAMETER
• eh_deadline (overall SCSI error handler deadline)
It is initially set to 0, so the default behaviour of the system remains unchanged.
When this timeout expires, the HBA port is reset, and commands are free to retry on another path. The HBA port reset takes approximately 10 to 14 seconds..
• The following command defines the eh_deadline timeout (#VALUE) in seconds for just the Fibre Channel SCSI host adapters within the system (execute from
/etc/rc.d/rc.local):
# for i in $(ls /sys/class/fc_host/); do echo #VALUE > /sys/class/scsi_host/$i/eh_deadline; done
• For all SCSI host adapters you can use:
# for i in $(ls /sys/class/scsi_host/*/eh_deadline); do echo #VALUE > $i ; done
REDHAT: IMPLEMENTING THE eh_deadline PARAMETER
Although all SCSI drivers will list eh_deadline as an attribute, not all drivers currently support changing the eh_deadline value.
For drivers that don't support changing the eh_deadline value a message of "echo: write error: Invalid argument" will be displayed.
Veritas recommends engaging RedHat support prior to implementing the eh_deadline attribute.
Do not set the eh_deadline value lower than 70 seconds.
IMPORTANT: Please test such settings extensively before applying them on production systems.
In the latest releases of RedHat, for example RHEL 7.9. The file "/sys/module/scsi_mod/parameters/eh_deadline" exists.
# cat /sys/module/scsi_mod/parameters/eh_deadline
-1
The "-1" value states the eh_deadline feature is disabled.
Veritas recommends setting the eh_deadline value to "90" seconds for all SCSI HBAs
# echo 90 > cat /sys/module/scsi_mod/parameters/eh_deadline
This will set a default value of 90 seconds for all SCSI HBAs.
REDHAT eh_deadline TIMEOUT
When the I/O is received by the lower layers, the scsi_timeout starts. Implement eh_deadline overall SCSI error handler timeout.
Device timeout (30 seconds by default)
eh_deadline set to 90 seconds
HBA Reset (14 seconds by default)
Total I/O service timeout of 134 seconds (includes 30 + 90 + 14 seconds HBA reset)
REDHAT: SCSI eh_timeout PARAMETER
The eh_timeout parameter is introduced in latest rhel6.4 kernel and in rhel6.5 onward kernel
eh_timeout (TEST UNIT READY error handler timeout)
The number of seconds to wait for TUR operations issued by the error handling code to respond.
# cat /sys/class/scsi_device/<h:c:t:l>/device/eh_timeout
The default is 5 sec.
This timer may run a minimum of 4 times in the case where the target has gone silent. This could be safely reduced to 5 seconds.
This will allow adequate time to complete several task management operations (as above), with the associated TEST UNIT READY operations, before the eh_deadline expires.
REDHAT: EMULEX lpfc_task_mgmt_tmo PARAMETER
The lpfc_task_mgmt_tmo (LPFC task management timeout (ABORT TASK, LUN RESET, TARGET RESET))
The number of seconds to wait for task management operations (ABORT TASK, LUN RESET, TARGET RESET) issued by the LPFC driver to respond.
# for i in $(ls /sys/class/scsi_host/*/lpfc_task_mgmt_tmo); do echo #VALUE > $i ; done
The default is 60 seconds. This timer may run a minimum of 4 times, in the case where the target has gone silent.
This timeout could be safely reduced to 20 seconds.
This will allow adequate time to attempt a reset, followed by a TEST UNIT READY, and to retry the process if the failure persists, before the eh_deadline expires.
DMP: MESSAGE Reached DMP Threshold IO Timeout
In the /etc/vx/dmpevents.log you may see messages such as:
Jan 17 20:51:51 BARNEY kernel: VxVM vxdmp V-5-3-0 Reached DMP Threshold IO TimeOut (100 secs) I/O with start 50cdf3a309b18 and end 50cdf4237868e time
The buffer start time means the time when the buffer enters DMP.
The buffer end time means the time when the SCSI layer returns the I/O to DMP.
# echo 50cdf3a309b18 | perl -ne 'print localtime(hex($_)/1e6)."\n”’
Sat Jan 17 20:49:37 2015
# echo 50cdf4237868e | perl -ne 'print localtime(hex($_)/1e6)."\n”’
Sat Jan 17 20:51:51 2015
The time difference between the two values is 134 seconds, which exceeds the DMP timebound iotimeout value of 100 seconds).
DMP FIXEDRETRY vs TIMEBOUND
The lower layers must still respond back to DMP, for DMP to react accordingly.
FixedRetry
Pro: In certain conditions, DMP will try the failed I/O at least 5 times (by default) against the DMPNODE (not 5 times per path).
Con: DMP will try the I/O on a single path and then proceed to retry the I/O on other available paths until the retry count is exhausted for a particular I/O.
If the I/O takes a long time to complete at the lower layers, then the upper layers may have to wait an undetermined amount of time before the lower layers return the I/O. This can delay the detection of a failed LUN/path.
Timebound
Pro: In certain conditions, DMP will continue to retry the failed IO until the defined iotimeout value is reached, for example 300 seconds (by default).
Con: If the I/O takes longer (more than 300 seconds by default) than the defined iotimeout value to reply back, DMP will fail the I/O without any further retries against an alternate path.
Overview of attribute settings
Solution
Please contact Veritas and RedHat support for further advice if required.
Do not set the DMP timebound iotimeout lower than 180 seconds across Linux environments.
Veritas recommends engaging RedHat support prior to implementing the eh_deadline attribute.
Do not set the eh_deadline value lower than 70 seconds.
IMPORTANT: Please test such settings extensively before applying them on production systems.
Applies To
RedHat 6.x