I have three 3 node clusters created in my environment. There are two synchronous nodes at one data center and one asynchronous node at another data center. Each node has its own disks so there's no shared storage. Occasionally maybe once
a month, my test cluster becomes unavailable. It won't ping and clients can't connect to the SQL cluster listener. It happened again this morning and was unavailable for about a half hour, then without changing anything, it came back to life.
In the past, I was able to initiate a manual failover/failback and it would come back right away but I'd like to find out why this is happening in the first place. This does not happen with the other clusters.
There are no cluster errors registered in event viewer. No nodes fail, I can connect to each one individually. The cluster name does not ping and neither does the SQL listener, so that leads me to believe it is a Microsoft clustering problem
and probably not a SQL issue.
I did a get-clusterlog and have been pouring through it. I do see DBG messages like:
[Verbose] 000009ac.00001678::2019/02/12-09:46:49.619 DBG [GEM] Node 1: GEM Id 3565 has been ack'ed by every node. Unacknowledged Message Count = 5
[Verbose] 000009ac.00001678::2019/02/12-09:46:49.619 DBG [GEM] Node 1: GEM Id 3566 has been ack'ed by every node. Unacknowledged Message Count = 4
[Verbose] 000009ac.00001678::2019/02/12-09:46:49.621 DBG [GEM] Node 1: GEM Id 3567 has been ack'ed by every node. Unacknowledged Message Count = 3
[Verbose] 000009ac.00001678::2019/02/12-09:46:49.621 DBG [GEM] Node 1: GEM Id 3568 has been ack'ed by every node. Unacknowledged Message Count = 2
[Verbose] 000009ac.00001678::2019/02/12-09:46:49.621 DBG [GEM] Node 1: GEM Id 3569 has been ack'ed by every node. Unacknowledged Message Count = 1
[Verbose] 000009ac.00001678::2019/02/12-09:46:49.621 DBG [GEM] Node 1: GEM Id 3570 has been ack'ed by every node. Unacknowledged Message Count = 0
Around the time it came back up.
Also entries like this often - the possible owners list size is 0 message is interesting.
[Verbose] 00001a40.00002920::2019/02/12-09:48:17.162 INFO [RES] Distributed Network Name <CAUSACSQvu8>: Netname received Refresh clones message
[Verbose] 00001a40.00002920::2019/02/12-09:48:17.162 INFO [RES] Distributed Network Name <CAUSACSQvu8>:
Possible owners list size is 0
[Verbose] 00001a40.000007d0::2019/02/12-09:48:17.165 INFO [RES] Network Name: Agent: InitializeModule, Trying to initialize Module(ad4aa780-67db-41df-977e-35ef8ac4be5f,Client) when there is one already in Initialized/Idle state
[Verbose] 00001a40.00002920::2019/02/12-09:48:17.165 INFO [RES] Distributed Network Name <CAUSACSQvu8>: StartupClone - Client module already exists.
[Verbose] 00001a40.000007d0::2019/02/12-09:48:17.165 INFO [RES] Distributed Network Name <CAUSACSQvu8>: Client: Synching with slow operation
I don't see a cause and effect relationship when this fails so I can't trigger it. It's like chasing a ghost. If anyone has a place to start looking, I'd appreciate it.