Guidelines for troubleshooting VxFS file system hang conditions

Article: 100011219
Last Published: 2014-03-27
Ratings: 3 0
Product(s): InfoScale & Storage Foundation

Problem

When a file system hang condition occurs, the applications accessing such file systems may become unresponsive and depending on the actual issue, sometimes it may also lead to a system wide hang situation. In most cases, the process thread(s) using the file system is blocked for I/O by the Operating System until the I/O is successfully sent to the disk subsystem and an acknowledgement is received from the underlying block device to the file system.
 
The actual hang or slowness in the I/O operations is not necessarily caused within the file system itself, but the problem could be in any part of the entire I/O stack, ranging from storage subsystem to SAN, HBA, SCSI, device multi pathing software , block device, file system or even the application itself.
 
To identify the actual problem, it is important that each component of the I/O stack should be investigated at the time of the issue. However, the scope of this article is to provide guidelines for troubleshooting and investigating Veritas File System (VxFS) component of the I/O stack. Other components of the I/O stack may require troubleshooting, but it is beyond the scope of this document to cover all aspects of troubleshooting the other components.

Solution

To isolate the I/O hang and in order to identify if the issue is related to the Veritas File System (VxFS), the following steps are recommended:
 
1) Identify if the file system is accessible or is completely hung for all commands/processes
 
# ls -l <problem file system mount point>
 
# cd <problem file system mount point>

 
# umount <problem file system mount point>
 
If the commands are hanging, do not kill the command. Let the process continue running for further troubleshooting.
 
2) Capture the current system process table
 
Solaris:
# /usr/ucb/ps -auxwww
# /usr/bin/ps -ealf

 
Linux:
# /bin/ps -ealf
 
Aix:
# /usr/bin/ps -ealf
 
HP-UX:
# /usr/bin/ps -ealf
 
 
3) Identify the processes currently using the file system
 
# fuser -cu <problem file system mount point>
 
4) Collect the stack trace of process id using the filesystem for every 60 seconds for 3 times.
 
Solaris:
# pstack <pid of process/(es) from step 2>
 
Linux:
# gdb -p <pid of process/(es) from step 2>
 
From the gdb> prompt, execute the "bt" command to capture the stack of the process and then run "quit" to exit from gdb.
 
Aix:
# procstack <pid of process/(es) from step 2>
 
HP-UX:
# pstack <pid of process/(es) from step 2>
 
 
5) Trace the process id which demonstrates a hang on the file system
 
Solaris
# /usr/bin/truss  -a  -e  -d -D -l -f  -o <output file name> -rall  -wall <command against file system>
and/or
# /usr/bin/truss  -a  -e  -d -D -l -f  -o <output file name> -rall  -wall -p <pid>
 
Linux
# strace -fFrTo <output file name>  <command against file system>
and/or
# strace -fFrTo <output file name>  -p <pid>
 
AIX
#  truss -defl  -o <output file name> <command>
and/or
# truss -defl –o <output file name> -p <pid>

HP-UX
# /usr/local/bin/tusc -aDeEfFlvTu -o <output file name> <command>
and/or
# /usr/local/bin/tusc -aDeEfFlvTu -o <output file name> <pid>

6) Collect the kernel thread list

# /opt/VRTSspt/FS/Cfstlist/cfstlist -c 5 -i 30 -o tlist_all.txt
 
The cfstlist command is available in the VRTSspt6.0 package in the 6.0 release and later. You can also use this package for capturing the thread list on earlier VxFS versions.
 
For cfstlist to work without passwords, you must configure ssh or RSH between the nodes. Also it depends on operating-system-native kernel debuggers which should be installed and configured properly. Most importantly, on Linux platform the kernel debugger "crash" requires kernel-debuginfo packages to work. A best practice is to manually check whether the operating system native debuggers works on each node and the nodes can communicate with each other through passwordless RSH or ssh.
 
If the nodes cannot be configured to communicate, the thread list should be gathered for each node separately using 'cfsnodetlist'. However attempts to collect these thread lists more closer to each other in time should be made.
 
 
7) Collect vxfsstat for every 60 seconds for 3 times for file system which has hang.
 
# vxfsstat -w outfile [-c count] [-t seconds] mount_point  

8) By collecting above information it will aid our support teams to isolate the issue and identify if the problem is related to the VxFS file system or is related to any component outside of the file system boundary.  If the evidence demonstrates that the issue can be related to the VxFS file system, then collect below evidence in addition to above requested outputs.
 
a) Collect VRTSExplorer at the time of the issue existing and collect command log/history that led to this issue to understand the problem correctly. Please follow below Technical Article to capture Veritas explorer output: https://www.veritas.com/support/en_US/article.100038650

b) If you plan to reboot the server, then stop the applications if any which is used other vxfs file system.  Now collect the system dump and reboot the server.
 
9) Please contact Veritas Technical Support after collecting the requested evidence for further investigation.

 

Applies To

Any Storage Foundation environment involving VxFS file systems

Was this content helpful?