High sustained latency after ONTAP upgrade to 9.7 P14 and above due to FPolicy EAGAIN

Article: 100052443
Last Published: 2023-08-30
Ratings: 1 1
Product(s): Data Insight

Problem

After upgrading ONTAP to 9.8 and above high sustained latency is experienced when a connection is made by the FPolicy server

Error Message

EMS or event log show report errors pertaining to EAGAIN errors similar to the following:

Cluster::> event log show -event *fpolicy*

[filer1: fpolicy: fpolicy.eagain.on.write:notice]: Write returned EAGAIN while sending notification to the FPolicy server "12.34.56.78" for vserver ID 3.

The fpolicycmod.log from the Data Insight Collector node may show the following:

2022-01-02 20:42:05 INFO: V-378-1339-2023: #{7296} [stat_calculate: 924] Filer nasfiler: instance nas01:kernel:nasfiler: Avg cifs_lat 137 microsecs, cifs_wrlat 169 microsecs, cifs_rdlat 65 microsecs. Avg nfsv3_lat 0.0 microsecs, nfsv3_wrlat 0.0 microsecs, nfsv3_rdlat 0.0 microsecs. Avg calculated over last 180 secs.
2022-01-02 20:42:05 INFO: V-378-1339-2023: #{7296} [stat_calculate: 924] Filer nasfiler: instance nas02:kernel:nasfiler: Avg cifs_lat 517 microsecs, cifs_wrlat 0 microsecs, cifs_rdlat 0 microsecs. Avg nfsv3_lat 0.0 microsecs, nfsv3_wrlat 0.0 microsecs, nfsv3_rdlat 0.0 microsecs. Avg calculated over last 180 secs.
2022-01-02 20:42:05 ERROR: V-378-1339-4104: #{7296} [set_data_vserv_hash: 2156] Failed to get volumes info for VServer[nasfiler]
2022-01-02 20:42:05 ERROR: V-378-1339-4077: #{7296} [process_vservers: 2898] Failed to add VServer nasfiler to allowed hash.
2022-01-02 20:42:05 ERROR: V-378-1339-4093: #{7296} [process_vservers: 2972] Last connection try to CIFS Server[NASFILER], VServer[nasfiler] for Cluster[nascluster] failed. Connection failure reason may include firewall settings blocking connection from this CIFS Server. Trying again to invoke a connection..
2022-01-02 20:42:05 INFO: V-378-1339-2023: #{7296} [stat_calculate: 924] Filer nasnmtest: instance nas01:kernel:nasnmtest: Avg cifs_lat 161 microsecs, cifs_wrlat 157 microsecs, cifs_rdlat 334 microsecs. Avg nfsv3_lat 0.0 microsecs, nfsv3_wrlat 0.0 microsecs, nfsv3_rdlat 0.0 microsecs. Avg calculated over last 180 secs.
2022-01-02 20:42:05 INFO: V-378-1339-2023: #{7296} [stat_calculate: 924] Filer nasnmtest: instance nas02:kernel:nasnmtest: Avg cifs_lat 0 microsecs, cifs_wrlat 0 microsecs, cifs_rdlat 0 microsecs. Avg nfsv3_lat 0.0 microsecs, nfsv3_wrlat 0.0 microsecs, nfsv3_rdlat 0.0 microsecs. Avg calculated over last 180 secs.

Cause

EAGAIN is when the Fpolicy Service Manager (FSM) is unable to read or write to the local receive or send buffer

Asynchronous Fpolicy will send screening messages to the fpolicy server out of band

If the Fpolicy server is slower to respond than the incoming screening requests, the initial requests are let through

Any in queue will wait for the response from the FSM

In turn, the FSM is waiting for response from the Fpolicy server, as the Fpolicy buffer size is full

This is logged as EAGAIN in the Fpolicy log

Solution

To address the above issue, Veritas & NetApp have identified respective solutions 

Veritas has released the hotfix to make ignored or unutilized notification handling more robust

This hotfix is highly recommended for the customers, who are planning the NetApp upgrade or has the NetApp 9.7P14 or higher versions 

However, the customers may apply the Veritas Data Insight hotfix, irrespective of the NetApp version in their environment.

  1. Please follow one of the following solution steps or both the solution steps to address the issue:

Data Insight 6.1RP5 (6.1.5): data_insight-6.1RP5_HF3 (veritas.com)

Data Insight 6.2: 6.2_HF2 (veritas.com)

Data Insight 6.3RP1 (6.3.1): 6.3RP1_HF1 (veritas.com)

2. Contact NetApp for details on patching of ONTAP

Related Documentation:

High sustained latency after ONTAP upgrade to 9.8 or 9.8P4 due to FPolicy - NetApp Knowledge Base

Fpolicy EAGAIN errors seen in fpolicy.log in ONTAP - NetApp Knowledge Base

Setting Receive-buffer in Data Insight to Align with NetApp Filer Send-buffer on High Sustained Latency Relating to EAGAIN error (veritas.com)

Was this content helpful?