For my writeups of monitoring metrics I will add what I see as necessary correlation metrics where relevant to have a more complete description of how I would use it
Always On Availability Groups is an enterprise level HA feature from Microsoft and was introduced way back in SQL server 2012 as an alternative to the now deprecated database mirroring feature. Here I will try and introduce some metrics I believe are needed as a starting point to monitor this feature.
It is important to keep in mind that synchronous replication means that the commit is sent to the application after the transaction has been hardened to all synchronous secondary transaction logs, while in asynchronous replication the commit is reported just as if there was no replication at all. No matter the synchronization mode, all logged operations are replicated and goes into the log send queue on the primary and ends up in the redo queue on the secondary. This further implies that data is not available on the secondary for reads before the redo thread has applied the change. For more information on the latency between primary and secondary replicas Microsoft discusses it here and here
The type of replication, synchronous or asynchronous, is important input when interpreting these metrics. However, it is not included here as it is not a metric in and of itself, but rather a configuration setting.
In my opinion there are five metrics that are of initial immediate interest for monitoring a generic AOAG setup. Other metrics can be important for your specific use case, but for initial monitoring these five should suffice as a starting point. If you are experiencing performance issues other metrics and wait statistics are important and should be investigated.
Synchronization state
Naturally, the first thing we need to know is the synchronization state of our AOAG. This is found in the synchronization_state (or its more human readable friendly synchronization_state_desc) metric in sys.dm_hadr_database_replica_states. We only want to see state 1 (SYNCHRONIZED) and 2 (SYNCHRONIZING) during normal operation. For synchronous replication we only want to see state 1. Anything other than state 1 means we have a potential for data loss as is expected with asynchronous replication, where we expect to see state 2. If you find yourself in a position where you need to fail over, it can be helpful as well to query sys.dm_hadr_database_replica_cluster_states. In it you will find a column named is_failover_ready. Join it to sys.dm_hadr_database_replica_states for an overview of failover ready replicas. The appropriate script for this is taken from here and is included for convenience:
SELECT arc.replica_server_name,
drc.is_failover_ready,
drc.database_name
FROM sys.dm_hadr_database_replica_cluster_states drc
INNER JOIN sys.dm_hadr_availability_replica_cluster_states arc
on drc.replica_id = arc.replica_id
Log Send Queue
The log_send_queue_size metric in sys.dm_hadr_database_replica_states exposes the amount of transaction log records on the primary node not yet sent to the secondary node. This is measured in KB. If this is increasing unexpectedly, there might be replication issues, or the log got hit with an unusual amount of activity. All logged transactions are replicated. Keep this in mind when doing your maintenance as well (i.e. index rebuilds for instance).
Log Redo Queue
The redo_queue_size exposes the amount of data currently hardened to that secondary’s transaction log but has not yet been redone. The data will not be available for reads before the data has been restored to the secondary database by the redo threads. This is worth keeping in mind if you are offloading reads to secondary replicas that are correlated with recent writes. If your reads are dependent on very recent writes, do not offload the reads to a secondary replica as the read data might not correlate with the written data. As a side note, starting in MSSQL 2016 parallel redo threads where introduced and set as the default in an effort make the redo process more efficient and increase log redo rate, but as everything else it has a cost. Serial redo can still be configured if desired.
Log send rate / log redo rate
The log_send_rate and the redo_rate metrics measure the send and redo activity respectively in KB/second. This is also useful as it can help give an indication of issues, for instance contention or blocking of the redo thread and possible connectivity issues between the AG replicas.
When you need a quick overview
For a quick overview of your AOAGs health, I refer you a slightly modified script from Tracy Boggiano. Microsoft has a slightly modified version themselves on this site with due credit. Modify it to suite your environments and your needs:
SELECT ag.name as 'AG Name',
ar.replica_server_name,
dbs.name as 'Database',
ag.is_distributed,
ars.role_desc,
ar.availability_mode_desc,
drs.synchronization_state_desc,
drs.synchronization_health_desc,
drs.log_send_queue_size,
drs.log_send_rate,
drs.redo_queue_size,
drs.redo_rate,
drs.suspend_reason_desc,
drs.last_sent_time,
drs.last_received_time,
drs.last_hardened_time,
drs.last_redone_time,
drs.last_commit_time,
drs.secondary_lag_seconds
FROM sys.databases dbs
INNER JOIN sys.dm_hadr_database_replica_states drs
on dbs.database_id = drs.database_id
INNER JOIN sys.availability_groups ag
ON drs.group_id = ag.group_id
INNER JOIN sys.dm_hadr_availability_replica_states ars
ON ars.replica_id = drs.replica_id
INNER JOIN sys.availability_replicas ar
ON ar.replica_id = ars.replica_id
Wrap up
I want to end this post with a caution. AOAG is seldom a set and forget feature. It requires somewhat regular DBA attention and if you are not in a position to give it, AOAG might not be the right choice for you. There are other options available and you should pick the option that suits your business needs and capabilities, including DBA availability.