Hello Experts,
Good day! We encountered a problem with our production system which are running on two-node WSFC/SQL AlwaysOn environment, and SQL AG was stuck in 'RESOLVING' state on 2016/04/13 from 21:17:19 to 21:18:15.(About 1 min).
This two-node are running Windows Server 2012 Standard, and DBMS are running SQL Server 2012 Service Pack 3.
Microsoft SQL Server 2012 (SP3) (KB3072779) - 11.0.6020.0 (X64)
Oct 20 2015 15:36:27 Copyright (c) Microsoft Corporation Enterprise Edition: Core-based Licensing (64-bit) on Windows NT 6.2 <X64> (Build 9200: ))
In some research, we've conducted to read some MS-KB article which mentioned similar situation issues/problem are listed below.
KB2699013-FIX_SQL Server 2012, SQL Server 2008 R2 or SQL Server 2008 stops responding and a 'Non-yielding Scheduler' error is logged
KB3081074-FIX_A stalled dispatcher system dump forces a failover and service outage in SQL Server 2014 or SQL Server 2012
KB3020116-FIX_"Non-yielding scheduler" error occurs and AlwaysOn Availability Group transits to RESOLVING state
KB3112363-Improvements for SQL Server AlwaysOn Lease Timeout supportability in SQL Server 2012
I thought/bold assumption this version of service pack/cumulative update was supposed to fix related problem, or not? Anyway, finally the issue has initiative found us, I just want to clarify and know, Is it still a unresolved potential bug? Any recommendation or advice is highly appreciated, thanks a lot!
PS. I have also created a thread which about the problem we has encountered on MSDN forum. FYI.
Kevin
=====================================================================
SQL Error Log
=====================================================================
04/13/2016 21:18:15,spid30s,Unknown,AlwaysOn Availability Groups connection with secondary database established for primary database 'QAP' on the availability replica with Replica ID: {ad0236ba-a1cf-449d-b1c1-ce7d3c86e9cc}. This is an informational message only. No user action is required.
04/13/2016 21:18:01,spid164,Unknown,The log shipping secondary database DL980-4.QAP has restore threshold of 45 minutes and is out of sync. No restore was performed for 520600 minutes. Restored latency is 10 minutes. Check agent log and logshipping monitor information.
04/13/2016 21:18:01,spid164,Unknown,Error: 14421<c/> Severity: 16<c/> State: 1.
04/13/2016 21:17:40,Logon,Unknown,Unable to access database 'QAP' because its replica role is RESOLVING which does not allow connections. Try the operation again later.
04/13/2016 21:17:40,Logon,Unknown,Error: 983<c/> Severity: 14<c/> State: 1.
04/13/2016 21:17:40,spid49s,Unknown,Nonqualified transactions are being rolled back in database QAP for an AlwaysOn Availability Groups state change. Estimated rollback completion: 100%. This is an informational message only. No user action is required.
04/13/2016 21:17:40,Logon,Unknown,Unable to access database 'QAP' because its replica role is RESOLVING which does not allow connections. Try the operation again later.
04/13/2016 21:17:40,Logon,Unknown,Error: 983<c/> Severity: 14<c/> State: 1.
………………………………………
04/13/2016 21:17:23,Logon,Unknown,Unable to access database 'QAP' because its replica role is RESOLVING which does not allow connections. Try the operation again later.
04/13/2016 21:17:23,Logon,Unknown,Error: 983<c/> Severity: 14<c/> State: 1.
04/13/2016 21:17:23,Logon,Unknown,Unable to access database 'QAP' because its replica role is RESOLVING which does not allow connections. Try the operation again later.
04/13/2016 21:17:23,Logon,Unknown,Error: 983<c/> Severity: 14<c/> State: 1.
04/13/2016 21:17:23,Logon,Unknown,Unable to access database 'QAP' because its replica role is RESOLVING which does not allow connections. Try the operation again later.
04/13/2016 21:17:23,Logon,Unknown,Error: 983<c/> Severity: 14<c/> State: 1.
04/13/2016 21:17:23,Logon,Unknown,Unable to access database 'QAP' because its replica role is RESOLVING which does not allow connections. Try the operation again later.
04/13/2016 21:17:23,Logon,Unknown,Error: 983<c/> Severity: 14<c/> State: 1.
………………………………………
04/13/2016 21:17:09,spid49s,Unknown,The availability group database "QAP" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization. This is an informational message only. No user action is required.
04/13/2016 21:17:09,Server,Unknown,Stopped listening on virtual network name 'dbgrpqap'. No user action is required.
………………………………………
04/13/2016 21:17:09,spid49s,Unknown,AlwaysOn Availability Groups connection with secondary database terminated for primary database 'QAP' on the availability replica with Replica ID: {ad0236ba-a1cf-449d-b1c1-ce7d3c86e9cc}. This is an informational message only. No user action is required.
04/13/2016 21:17:09,Server,Unknown,The state of the local availability replica in availability group 'AGQAP' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'. The replica state changed because of either a startup<c/> a failover<c/> a communication issue<c/> or a cluster error. For more information<c/> see the availability group dashboard<c/> SQL Server error log<c/> Windows Server Failover Cluster management console or Windows Server Failover Cluster log.
04/13/2016 21:17:09,Server,Unknown,AlwaysOn: The local replica of availability group 'AGQAP' is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.
04/13/2016 21:17:08,Server,Unknown,Process 0:0:0 (0xfe4) Worker 0x0000000006B16160 appears to be non-yielding on Scheduler 28. Thread creation time: 13102488101502. Approx Thread CPU Used: kernel 1591 ms<c/> user 0 ms. Process Utilization 7%. System Idle 94%. Interval: 74565 ms.
04/13/2016 21:17:08,Server,Unknown,The lease between availability group 'AGQAP' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly<c/> check the corresponding availability group resource in the Windows Server Failover Cluster.
04/13/2016 21:17:08,Server,Unknown,Error: 19407<c/> Severity: 16<c/> State: 1.
………………………………………
04/13/2016 21:17:08,Server,Unknown,DoMiniDump () encountered error (0x80004005) - Unspecified error
04/13/2016 21:17:07,Server,Unknown,Timeout waiting for external dump process 8480.
04/13/2016 21:17:07,Server,Unknown,Windows Server Failover Cluster did not receive a process event signal from SQL Server hosting availability group 'AGQAP' within the lease timeout period.
04/13/2016 21:17:07,Server,Unknown,Error: 19419<c/> Severity: 16<c/> State: 1.
04/13/2016 21:16:18,Server,Unknown,Stack Signature for the dump is 0x000000000000036A
04/13/2016 21:16:18,Server,Unknown,* *******************************************************************************
04/13/2016 21:16:18,Server,Unknown,*
04/13/2016 21:16:18,Server,Unknown,* Non-yielding Scheduler
04/13/2016 21:16:18,Server,Unknown,*
04/13/2016 21:16:18,Server,Unknown,* 04/13/16 21:16:18 spid 3976
04/13/2016 21:16:18,Server,Unknown,* BEGIN STACK DUMP:
04/13/2016 21:16:18,Server,Unknown,*
04/13/2016 21:16:18,Server,Unknown,* *******************************************************************************
04/13/2016 21:16:18,Server,Unknown,***Unable to get thread context for spid 0
=====================================================================
BugCheck Dump Text
=====================================================================
Current time is 21:16:17 04/13/16.
This file is generated by Microsoft SQL Server
version 11.0.6020.0
upon detection of fatal unexpected error. Please return this file,
the query or program that produced the bugcheck, the database and
the error log, and any other pertinent information with a Service Request.
Computer type is Intel(R) Xeon(R) CPU X7560 @ 2.27GHz.
Bios Version is HP - 2
128 X64 level 8664, 2 Mhz processor (s).
Windows NT 6.2 Build 9200 CSD .
Memory
MemoryLoad = 99%
Total Physical = 1048565 MB
Available Physical = 3472 MB
Total Page File = 1348565 MB
Available Page File = 364906 MB
Total Virtual = 8388607 MB
Available Virtual = 7188925 MB
***Unable to get thread context for spid 0
* *******************************************************************************
*
* BEGIN STACK DUMP:
* 04/13/16 21:16:18 spid 3976
*
* Non-yielding Scheduler
*
* *******************************************************************************
=====================================================================
SQL Server Memory Configuration & Availability Group Properties
=====================================================================
Memory_usedby_Sqlserver_MB Locked_pages_used_Sqlserver_MB Total_VAS_in_MB process_physical_memory_low process_virtual_memory_low
-------------------------- ------------------------------ -------------------- --------------------------- --------------------------
960375 950171 8388607 0 0
=====================================================================
Object Name Value Type
------ ---- ----- ----
AGQAP VerboseLogging 0 UInt32
AGQAP LeaseTimeout 100000 UInt32
AGQAP FailureConditionLevel 1 UInt32
AGQAP HealthCheckTimeout 150000 UInt32