Arctera (Veritas) Volume Replicator (VVR) experiences pauses due to network disconnects triggered by non-Veritas zero packets sent on port 4145
Problem
Arctera (Veritas) Volume Replicator (VVR) experiences pauses due to network disconnects triggered by non-Arctera (Veritas) zero packets sent on port 4145
Error Message
On DR site:
Jul 21 04:51:52 DRsysA kernel: VxVM VVR vxio V-5-0-1404 Heartbeat unacknowledged from node #.#.#.# for 28 seconds
Jul 21 04:51:53 DRsysA kernel: VxVM VVR vxio V-5-0-1404 Heartbeat unacknowledged from node #.#.#.# for 29 seconds
Jul 21 04:51:53 DRsysA kernel: VxVM VVR vxio V-5-0-1404 Heartbeat unacknowledged from node #.#.#.# for 29 seconds
Jul 21 04:51:54 DRsysA kernel: VxVM VVR vxio V-5-0-1404 Heartbeat unacknowledged from node #.#.#.# for 30 seconds
Jul 21 04:51:54 DRsysA kernel: VxVM VVR vxio V-5-0-1404 Heartbeat unacknowledged from node #.#.#.# for 30 seconds
On PROD site:
Jul 21 04:51:47 prodsysA kernel: VxVM VVR vxio V-5-0-1404 Heartbeat unacknowledged from node #.#.#.# for 23 seconds
Jul 21 04:51:47 prodsysA kernel: VxVM VVR vxio V-5-0-1404 Heartbeat unacknowledged from node #.#.#.# for 23 seconds
Jul 21 04:51:48 prodsysA kernel: VxVM VVR vxio V-5-0-1404 Heartbeat unacknowledged from node #.#.#.# for 24 seconds
Jul 21 04:51:48 prodsysA kernel: VxVM VVR vxio V-5-0-1404 Heartbeat unacknowledged from node #.#.#.# for 24 seconds
Rlinks got disconnected after the heartbeat did not happen for many seconds.
Jul 21 04:51:55 prodsysA kernel: VxVM VVR vxio V-5-0-1406 Node #.#.#.# disconnected from node #.#.#.#
Jul 21 04:51:55 prodsysA kernel: VxVM VVR vxio V-5-0-1406 Node #.#.#.# disconnected from node #.#.#.#
Cause
VVR UDP heartbeat packets are sent on port 4145 which is also used for replicating data. Whilst the replication itself is done on TCP, the heartbeats are sent on UDP. The UDP packets are sent from 4145 port at the source to 4145 port at the destination. If a zero length rogue UDP packet is sent from a random port to replication port 4145, VVR is unable to handle this packet and so a network disconnect occurs.
During in-house testing it was observed that running a port scanner (such as nmap) in parallel in a VVR environment produced zero length UDP packets from a random port to replication port 4145.
For example:
06:06:47.003073 IP (tos 0x0, ttl 40, id 52725, offset 0, flags [none], proto UDP (17), length 28)
prodsysA.local.53180 > prodsysA.local.vvr-control: [udp sum ok] UDP,length 0
0x0000: 4500 001c cdf5 0000 2811 f2c0 c0a8 2865 E.......(.....(e
0x0010: c0a8 2865 cfbc 1031 0008 4dd5 e800 0000 ..(e...1..M.....
0x0020: 3747 b366 d458 dd01 8000 0000 7G.f.X......
The above UDP packet shows length as "length 0" from random port "53180" to replication port "4145". 0xcfbc is port 53180. 0x1031 is port 4145
Since this packet has no NMCOM signature, it does not belong to VVR.
A VVR heartbeat packet will look as follows:
09:46:28.793293 IP (tos 0x0, ttl 64, id 38901, offset 0, flags [DF], proto UDP (17), length 350)
DRsysA.vvr-control > prodsysA.local.vvr-control:[udp sum ok] UDP, length 322
0x0000: 4500 015e 97f5 4000 4011 f71a c0a8 14c9 E..^..@.@.......
0x0010: c0a8 1465 1031 1031 014a 07e7 6fa2 0a10 ...e.1.1.J..o...
0x0020: d6c9 88f7 4e4d 434f 4d00 0102 0000 0000 ....NMCOM.......
0x0030: 0000 0258 0000 0000 0000 0000 0000 0258 ...X...........X
0x0040: 0000 0000 0169 784c 0006 0000 0000 0000 .....ixL........
.
0x00a0: 0000 0000 0000 5d4a 0000 0000 0000 0002 ......]J........
.
0x00a0: 0000 0000 0000 5d4a 0000 0000 0000 0002 ......]J........
Packet explanation:
DRsysA.vvr-control > prodsysA.local.vvr-control shows UDP packet flow (which is from DR to Prod).
packet flow seen in hex row 0x0010 is from port 0x1031 (4145) to port 0x1031 (4145).
Hex row 0x0020 contains 4e4d 434f 4d ==> this is NMCOM header magic.
This header magic indicates that this is a VVR packet.
Solution
A hotfix has been created to address this issue. It should be noted that this hotfix has not undergone extensive QA testing, so if this issue is not adversely affecting the VVR environment and/or a workaround is already in place, then the Arctera (Veritas) recommendation is to wait for the public release of this fix to be available. In the event that the hotfix is required, please contact Technical Support.
A supported hotfix has been made available for this issue. Please contact Technical Support to obtain this fix. This hotfix has not yet gone through any extensive Q&A testing. Consequently, if you are not adversely affected by this problem and have a satisfactory temporary workaround in place, we recommend that you wait for the public release of this hotfix.
The Product Engineering Team currently plans to address this issue by way of a patch or hotfix to the current version of the software. Please note that we as a company reserve the right to remove any fix from the targeted release if it does not pass quality assurance tests. Our plans are subject to change and any action taken by you based on the above information or your reliance upon the above information is made at your own risk.
Please contact your Sales representative or the Sales group for upgrade information including upgrade eligibility to the release containing the resolution for this issue.