Access 3340 appliance version 7.4.2 can suffer hangs or unexpected reboot of 1 node when using Veritas Data Deduplication (VDD)
Problem
When using the VDD solution with Access 3340 appliance at version 7.4.2 a hang or unexpected reboot (panic) of the node running the VDD services can occur.
In the case of the hang event, the VDD service is not failed over to the remaining node.
Error Message
There is no error message.
Cause
Processes associated with VDD service, spoold and spad, consume large amounts of memory. Depending on workload and physical RAM installed the memory usage of these processes can become too large, causing unexpected behaviour of the node as other processes find it increasingly difficult to allocate memory they require. Eventually, a hang or panic can occur.
Solution
Patch 7.4.2.100 for Access 3340 has been released which contains fixes for the VDD service and improved tuning of memory management. These fixes and tuning will improve stability and decrease the likelihood of a hang or panic.
However, depending on workloads, the VDD processes may require large amounts of memory. If the Access 3340 appliance is intended for use with the VDD solution, it is recommended to have at least 380GB of RAM per node.
For Appliances at 7.4.2 it is recommended to apply the 7.4.2.100 patch. If that is not possible, the following tuning can be made to improve stability. Note that this requires the VDD service to be stopped.
1. Tune vm.swappiness from zero to ten. Note these actions need to be carried out on all nodes in the cluster:
Check the current value is zero,adjust to ten and confirm the new value:sysctl vm.swappiness
sysctl vm.swappiness=10
sysctl vm.swappiness
Adjust boot script to ensure the new value is persistent across reboots
cp /opt/VRTSnas/scripts/misc/nas_always.sh /opt/VRTSnas/scripts/misc/nas_always.orig
vi /opt/VRTSnas/scripts/misc/nas_always.sh
Alter line 158 to set the swappiness value to 10
158 echo 10 > /proc/sys/vm/swappiness
2. Tune spoold MaxCacheSize from the default of 75%
If the 3340 cluster nodes each have 350GB RAM or more, use a value of 25%.
If the 3340 cluster nodes each have less than 350GB RAM , use a value of 50%
Note, this action only needs to be carried out on any one node.
Stop the VDD service using the Access Cluster clish
CLISH> dedupe show
CLISH> dedupe stop
Adjust the VDD config file to set the new value
cp /vx/<FILESYSTEM>/dedupe/etc/puredisk/contentrouter.cfg /vx/<FILESYSTEM>/dedupe/etc/puredisk/contentrouter.cfg.orig
vi /vx/<FILESYSTEM>/dedupe/etc/puredisk/contentrouter.cfg
Alter line 401 to set MaxCacheSize from 75% to either 50% or 25%
MaxCacheSize=50%
Restart the VDD service
CLISH> dedupe start
CLISH> dedupe show
After making the tunable changes, monitor the memory usage of the spoold process. If it reaches 80% of RAM or more, then restart the VDD service to ensure stability of the appliance.
In the appliance CLISH, go to the Monitor section and use the Top optionCLISH.Main_Menu> Monitor
CLISH.Monitor> Top
When top is running, press <shift>M
to sort the output on memory usage.
Look for the spoold process and make a note of the value in the %MEM column.
Press 'q' to exit from the top screen and return to the clish prompt.
If spoold is using around 80% or more, then restart the VDD service.
CLISH> dedupe stop
CLISH> dedupe start
CLISH> dedupe show
Reconfiguring VDD
In the event that VDD is unconfigured and then configured again the MaxCacheSize tunable will be reset back to the default of 75%. After configuring dedupe, repeat step 2 above to set MaxCacheSize.
EEB Considerations
It is highly recommended to apply patch 7.4.2.100. However, if it is not possible to apply the patch, the following EEB's are available for 7.4.2. The EEB's marked as public are available from the SORT website, please contact Veritas Support for other EEB's.
EEB | Verison | Description |
3972425 | 3 | Disable selective email notifications |
3971871 | 4 | The EEB bundle contains fixes for VDD issues on Access 7.4.2 |
3971580 | 1 | Fix Access Appliance memory corruption issue (public EEB) |
3964974 | 6 | Fix reboot/shutdown issue in cluster nodes (public EEB) |