"V-5-1-11092 cleanup_client: (There are minor number conflicts on a slave node)" is reportred when a CVM slave fails to join the cluster because of minor number conflict
Problem
A Cluster Volume Manager (CVM) slave fails to join the cluster because of minor number conflict.Error Message
The following messages are logged when the CVM slave tries to join the cluster.
Feb 29 18:53:21 server102 kernel: GAB INFO V-15-1-20036 Port v[GAB_LEGACY_CLIENT (refcount 0)] gen cf0652 membership 01
Feb 29 18:53:21 server102 kernel: GAB INFO V-15-1-20038 Port v[GAB_LEGACY_CLIENT (refcount 0)] gen cf0652 k_jeopardy ; 2
Feb 29 18:53:21 server102 kernel: GAB INFO V-15-1-20040 Port v[GAB_LEGACY_CLIENT (refcount 0)] gen cf0652 visible ; 2
Feb 29 18:53:21 server102 kernel: GAB INFO V-15-1-20036 Port y[GAB_LEGACY_CLIENT (refcount 0)] gen cf0653 membership 01
Feb 29 18:53:21 server102 kernel: GAB INFO V-15-1-20038 Port y[GAB_LEGACY_CLIENT (refcount 0)] gen cf0653 k_jeopardy ; 2
Feb 29 18:53:21 server102 kernel: GAB INFO V-15-1-20040 Port y[GAB_LEGACY_CLIENT (refcount 0)] gen cf0653 visible ; 2
Feb 29 18:53:21 server102 vxvm:vxconfigd: V-5-1-7900 CVM_VOLD_CONFIG command received
Feb 29 18:53:21 server102 kernel: VxVM vxio V-5-3-1906 vol_gab_ms_msg: ring broadcast commit completed for join/leave reconfig cs_flags=0x580802
Feb 29 18:53:22 server102 kernel: GAB INFO V-15-1-20036 Port m[GAB_LEGACY_CLIENT (refcount 0)] gen cf0651 membership 01
Feb 29 18:53:22 server102 kernel: GAB INFO V-15-1-20038 Port m[GAB_LEGACY_CLIENT (refcount 0)] gen cf0651 k_jeopardy ; 2
Feb 29 18:53:22 server102 kernel: GAB INFO V-15-1-20040 Port m[GAB_LEGACY_CLIENT (refcount 0)] gen cf0651 visible ; 2
Feb 29 18:53:22 server102 kernel: VxVM vxio V-5-3-2015 reconfig message on port m received cf0650
Feb 29 18:53:26 server102 kernel: GAB INFO V-15-1-20036 Port w[GAB_USER_CLIENT (refcount 0)] gen cf0655 membership 01
Feb 29 18:53:26 server102 kernel: GAB INFO V-15-1-20038 Port w[GAB_USER_CLIENT (refcount 0)] gen cf0655 k_jeopardy ; 2
Feb 29 18:53:26 server102 kernel: GAB INFO V-15-1-20040 Port w[GAB_USER_CLIENT (refcount 0)] gen cf0655 visible ; 2
Feb 29 18:53:26 server102 kernel: VxVM vxio V-5-0-1910 Cleaning incomplete shared diskgroup devldg dgiid 33792.104
Feb 29 18:53:26 server102 vxvm:vxconfigd: V-5-1-11092 cleanup_client: (There are minor number conflicts on a slave node) 231
Feb 29 18:53:26 server102 vxvm:vxconfigd: V-5-1-11467 kernel_fail_join() : Reconfiguration interrupted: Reason is retry to add a node failed (13, 0)
Feb 29 18:53:26 server102 kernel: VxVM vxio V-5-0-164 Failed to join cluster clus123, aborting
Feb 29 18:53:26 server102 kernel: VxVM vxio V-5-3-1250 joinsio_done: Node aborting, join for node 0 being failed
Feb 29 18:53:26 server102 kernel: VxVM vxio V-5-3-672 abort_joinp: aborting joinp for node 0 with err 11
Feb 29 18:53:26 server102 kernel: GAB INFO V-15-1-20032 Port y closed
Feb 29 18:53:26 server102 vxvm:vxconfigd: V-5-1-7901 CVM_VOLD_STOP command received
Feb 29 18:53:26 server102 kernel: GAB INFO V-15-1-20032 Port m closed
Feb 29 18:53:26 server102 kernel: GAB INFO V-15-1-20032 Port v closed
Feb 29 18:53:26 server102 kernel: GAB INFO V-15-1-20032 Port w closed
Feb 29 18:53:26 server102 AgentFramework[11752]: VCS ERROR V-16-20006-1005 CVMCluster:cvm_clus:monitor:node - state: out of cluster
reason: There are minor number conflicts on a slave node: retry to add a node failed
Feb 29 18:55:26 server102 AgentFramework[11752]: VCS ERROR V-16-2-13066 Thread(4132432704) Agent is calling clean for resource(cvm_clus) because the resource is not up even after online completed.
Feb 29 18:55:26 server102 Had[11726]: VCS ERROR V-16-2-13066 (server102) Agent is calling clean for resource(cvm_clus) because the resource is not up even after online completed.
Feb 29 18:55:26 server102 AgentFramework[11752]: VCS ERROR V-16-2-13068 Thread(4132432704) Resource(cvm_clus) - clean completed successfully.
Feb 29 18:55:26 server102 AgentFramework[11752]: VCS ERROR V-16-2-13072 Thread(4132432704) Resource(cvm_clus): Agent is retrying online (attempt number 1 of 2).
Cause
This problem is usually caused by a minor number conflict between CVM shared diskgroup objects, such as volumes, volume sets or Replicated Volume Groups (RVGs) and the private diskgroup objects. Confirm that on the joining CVM slave, the minor numbers of the private diskgroup objects don't overlap with the CVM diskgroup objects. The problem should be automatically taken care of by autoreminor feature which is enabled, by default.Figure 1 - Extract from the vxtune(1M) manual page
The following tunable parameters apply for Cluster Volume Manager (CVM): autoreminor Turns on or off the automatic reminor functionality. A disk group cannot be imported if the device minor numbers of the disk group or its objects conflict with those of an existing disk group. When autoreminor is on, VxVM automatically assigns new minor numbers to a disk group if VxVM detects a conflict during an import operation. The disk group is then imported. The default value is on. Note: VxVM does not reminor a disk group that is already imported, regardless of whether autoreminor is set to on. For example, if you attempt to add a node to a cluster and the joining node has minor numbers that conflict with a disk group in the cluster. In this case, the join operation fails. You must reminor the disk group manually. In some scenarios such as with NFS file systems, assigning new minor numbers may result in issues. In this case, set the tunable parameter to off. When the autoreminor parameter is set to off, attempting to import a disk group with conflicting minor numbers will fail, even when you specify the force (-f) option. You must manually reminor the disk group before you can import the disk group.
Volume Manager (VxVM) divides the minor numbers into two sets. One set is for the private diskgroups. The other set is for the CVM shared diskgroups. The two sets of minor numbers are divided by the following vxtune parameter:
- sharedminorstart: The starting number in the range used to assign device minor numbers in shared (CVM) disk groups. The default value is 33000.
Sometimes, when a diskgroup is first initialized as a private diskgroup (running vxdg init without the -s option) VxVM will assign a base_minor of less than 33000. Later when the diskgroup is imported as shared, VxVM will not change the minor numbers and they remain below the 33000 boundary. It may then have a chance to collide with existing private diskgroups on another nodes in the cluster.
The same issue also exists the other way round. If a diskgroup iss initialized as shard at first ( vxdg -s init) but later imported as private, then this private diskgroup will have minor number higher than 33000. In such situation you may want to run vxdg remnor to move the minor numbers back to the correct set to avoid conflicts.
Apart from the above possible cause, there is also one obscure cause for the CVM slave join to fail with minor number conflict. When two diskgroups are reminored to have the same base_minor while keeping some of the existing in-kernel minor numbers not changes, the CVM slave join will fail with minor number conflict.
The following is an example on how this can happen.
First two diskgroups are created and they have different base_minor, then one diskgroup is reminored to 36000.
server101# vxdg -g proddg reminor 36000server101# vxprint -m -g proddg | grep 'minor=[0-9]' base_minor=36000 minor=36000
The CFS filesystems are unmounted and the diskgroup is deported.
server101# hagrp -offline SGprod -sys server101 # umounted the filesystemserver101# vxdg deport proddg
The other diskgroup is also reminored to the same base_minor 36000 while having the CFS filesystems mounted.
server101# vxdg -g devldg reminor 36000VxVM vxdg WARNING V-5-1-3858 Volume devlvol01: Device is open, will renumber on import
Since the filesystem are still mounted, VxVM will not change the existing in-kernel minor number.
server101# ls -lR /dev/vx/rdsk..../dev/vx/rdsk/devldg:total 0crw-------. 1 root root 199, 28000 Feb 29 18:44 devlvol01 <<< in-kernel, the volume minor number is still the old one 28000
But the on-disk configuration has already been changed to 36000.
server101# vxprint -m -g devldg | grep 'minor=[0-9]' base_minor=36000 minor=36000
Now, the previously deported diskgroup is imported.
server101# hagrp -online SGprod -sys server101Feb 29 18:51:59 server101 vxvm:vxconfigd: V-5-1-11401 : dg import with I/O fence enabledFeb 29 18:51:59 server101 vxvm:vxconfigd: V-5-1-11401 proddg: dg import with I/O fence enabledFeb 29 18:51:59 server101 kernel: sd 3:0:0:13: reservation conflictFeb 29 18:51:59 server101 kernel: sd 5:0:0:13: reservation conflictFeb 29 18:51:59 server101 vxvm:vxconfigd: V-5-1-16765 Selecting configuration database copy from disk_6 from disks: disk_6Feb 29 18:52:03 server101 vxvm:vxconfigd: V-5-1-16766 Trying to import the disk group proddg using configuration database copy from disk_6Feb 29 18:52:04 server101 vxvm:vxconfigd: V-5-1-16254 Disk group import of proddg succeeded.
From the ls -lR /dev/vx/rdsk/ output, there is no minor number conflict.
server101# ls -lR /dev/vx/rdsk..../dev/vx/rdsk/devldg:total 0crw-------. 1 root root 199, 28000 Feb 29 18:44 devlvol01 <<< 28000/dev/vx/rdsk/proddg:total 0crw-------. 1 root root 199, 36000 Feb 29 18:52 prodvol01 <<< 36000
But now the slave will not join because the on-disk base_minor are the same 36000.
server101# hagrp -online cvm -sys server102Feb 29 18:53:26 server102 kernel: VxVM vxio V-5-0-1910 Cleaning incomplete shared diskgroup devldg dgiid 33792.104Feb 29 18:53:26 server102 vxvm:vxconfigd: V-5-1-11092 cleanup_client: (There are minor number conflicts on a slave node) 231
Solution
Confirm that both the in-kernel minor numbers, and the on-disk minor numbers, are not conflicting.For in-kernel minor numbers, you can use the following commands:
server101# ls -lR /dev/vx/rdsk..../dev/vx/rdsk/devldg:total 0crw-------. 1 root root 199, 28000 Feb 29 18:44 devlvol01/dev/vx/rdsk/proddg:total 0crw-------. 1 root root 199, 36000 Feb 29 18:52 prodvol01
For on-disk minor numbers, use vxprint:
server101# vxprint -m -g devldg | grep 'minor=[0-9]' base_minor=36000 minor=36000server101# vxprint -m -g proddg | grep 'minor=[0-9]' base_minor=36000 minor=36000
Run the above command for all of the diskgroups that are currently imported.
If there is any base_minor conflict, please run vxdg reminor to fix the issue.
# vxdg -g proddg reminor 37000
Choose a new minor number that is not used by any diskgroups.
To have the minor numbers change immediately, unmount the filesystems first before running vxdg reminor.