Problem
After upgrading ONTAP to 9.8 and above high sustained latency is experienced when a connection is made by the FPolicy server
Error Message
EMS or event log show
report errors pertaining to EAGAIN errors similar to the following:
Cluster::> event log show -event *fpolicy*
[filer1: fpolicy: fpolicy.eagain.on.write:notice]: Write returned EAGAIN while sending notification to the FPolicy server "12.34.56.78" for vserver ID 3.
The fpolicycmod.log
from the Data Insight Collector node may show the following:
2022-01-02 20:42:05 INFO: V-378-1339-2023: #{7296} [stat_calculate: 924] Filer nasfiler: instance nas01:kernel:nasfiler: Avg cifs_lat 137 microsecs, cifs_wrlat 169 microsecs, cifs_rdlat 65 microsecs. Avg nfsv3_lat 0.0 microsecs, nfsv3_wrlat 0.0 microsecs, nfsv3_rdlat 0.0 microsecs. Avg calculated over last 180 secs.
2022-01-02 20:42:05 INFO: V-378-1339-2023: #{7296} [stat_calculate: 924] Filer nasfiler: instance nas02:kernel:nasfiler: Avg cifs_lat 517 microsecs, cifs_wrlat 0 microsecs, cifs_rdlat 0 microsecs. Avg nfsv3_lat 0.0 microsecs, nfsv3_wrlat 0.0 microsecs, nfsv3_rdlat 0.0 microsecs. Avg calculated over last 180 secs.
2022-01-02 20:42:05 ERROR: V-378-1339-4104: #{7296} [set_data_vserv_hash: 2156] Failed to get volumes info for VServer[nasfiler]
2022-01-02 20:42:05 ERROR: V-378-1339-4077: #{7296} [process_vservers: 2898] Failed to add VServer nasfiler to allowed hash.
2022-01-02 20:42:05 ERROR: V-378-1339-4093: #{7296} [process_vservers: 2972] Last connection try to CIFS Server[NASFILER], VServer[nasfiler] for Cluster[nascluster] failed. Connection failure reason may include firewall settings blocking connection from this CIFS Server. Trying again to invoke a connection..
2022-01-02 20:42:05 INFO: V-378-1339-2023: #{7296} [stat_calculate: 924] Filer nasnmtest: instance nas01:kernel:nasnmtest: Avg cifs_lat 161 microsecs, cifs_wrlat 157 microsecs, cifs_rdlat 334 microsecs. Avg nfsv3_lat 0.0 microsecs, nfsv3_wrlat 0.0 microsecs, nfsv3_rdlat 0.0 microsecs. Avg calculated over last 180 secs.
2022-01-02 20:42:05 INFO: V-378-1339-2023: #{7296} [stat_calculate: 924] Filer nasnmtest: instance nas02:kernel:nasnmtest: Avg cifs_lat 0 microsecs, cifs_wrlat 0 microsecs, cifs_rdlat 0 microsecs. Avg nfsv3_lat 0.0 microsecs, nfsv3_wrlat 0.0 microsecs, nfsv3_rdlat 0.0 microsecs. Avg calculated over last 180 secs.
Cause
EAGAIN is when the Fpolicy Service Manager (FSM) is unable to read or write to the local receive or send buffer
Asynchronous Fpolicy will send screening messages to the fpolicy server out of band
If the Fpolicy server is slower to respond than the incoming screening requests, the initial requests are let through
Any in queue will wait for the response from the FSM
In turn, the FSM is waiting for response from the Fpolicy server, as the Fpolicy buffer size is full
This is logged as EAGAIN in the Fpolicy log
Solution
To address the above issue, Veritas & NetApp have identified respective solutions
Veritas has released the hotfix to make ignored or unutilized notification handling more robust
This hotfix is highly recommended for the customers, who are planning the NetApp upgrade or has the NetApp 9.7P14 or higher versions
However, the customers may apply the Veritas Data Insight hotfix, irrespective of the NetApp version in their environment.
- Please follow one of the following solution steps or both the solution steps to address the issue:
Data Insight 6.1RP5 (6.1.5): data_insight-6.1RP5_HF3 (veritas.com)
Data Insight 6.2: 6.2_HF2 (veritas.com)
Data Insight 6.3RP1 (6.3.1): 6.3RP1_HF1 (veritas.com)
2. Contact NetApp for details on patching of ONTAP
Related Documentation:
High sustained latency after ONTAP upgrade to 9.8 or 9.8P4 due to FPolicy - NetApp Knowledge Base
Fpolicy EAGAIN errors seen in fpolicy.log in ONTAP - NetApp Knowledge Base