Check cluster health via nagios plugin

Home > Suse > SAP setup and maintenance > Check cluster health via nagios plugin

We can monitor cluster health using nagios plugin using: Not tested in production

Refer Configuring nrpe based internal service checks on how nrpe based internal checks work for remote systems.

Create a plugin to be called via nrpe on the cluster host '/usr/lib64/nagios/plugins/cluster_check.sh' with:

#!/bin/bash

# Run crm status command and capture output
crm_output=$(crm status 2>&1)

# Check for error or warning in output, ignoring case
if [[ "$crm_output" =~ error || "$crm_output" =~ warning ]]; then
  # Send email alert with hostname and IP
  hostname=$(hostname)
  ip=$(hostname -I | awk '{print $1}')
  echo "Cluster status is not healthy on $hostname ($ip)!" 
  exit 2 # Nagios exit code for critical
fi

# Check if all nodes are online
if [[ "$crm_output" =~ Online:\ \[\ (.*)\ \] ]]; then
  online_nodes=${BASH_REMATCH[1]}
  if [[ "$online_nodes" =~ \[.*\] ]]; then
    # Send email alert with hostname and IP
    hostname=$(hostname)
    ip=$(hostname -I | awk '{print $1}')
    echo "Not all nodes are online on $hostname ($ip)!"
    exit 2 # Nagios exit code for critical
  fi
fi

# Check if all resources are started
if [[ "$crm_output" =~ Full\ list\ of\ resources:\$'\n'\ (.*) ]]; then
  resources=${BASH_REMATCH[1]}
  if [[ "$resources" =~ \*\* ]]; then
    # Send email alert with hostname and IP
    hostname=$(hostname)
    ip=$(hostname -I | awk '{print $1}')
    echo "Not all resources are started on $hostname ($ip)!" 
    exit 2 # Nagios exit code for critical
  fi
fi

echo "Cluster status is healthy!"
exit 0 # Nagios exit code for OK

Edit '/etc/nagios/nrpe.conf' to have below:

command[check_cluster_status]=/usr/lib64/nagios/plugins/cluster_check.sh

Restart nrpe on the cluster machine

Then configure remote service check using above plugin for appropriate host using below nagios service configuration:

define host {
    use           linux-server
    host_name     example-host
    alias         Example Host
    address       192.0.2.100
}

define service {
    use                 generic-service
    host_name           example-host
    service_description Check Cluster Status
    check_command       check_nrpe!check_cluster_status
    check_interval      60 ; Check every 60 seconds
    retry_interval      10 ; Retry every 10 seconds if check fails
    notification_interval 120 ; Send a notification every 2 hours
    contact_groups      admins
}

Restart nagios service on server
Validate whether proper health of cluster status is being captured
Optionally stop a resource and see whether latest status is reflected properly. Consider adding a virtual IP for testing in production systems. This virtual IP can be removed after testing.

Home > Suse > SAP setup and maintenance > Check cluster health via nagios plugin

Anonymous

Search

Check cluster health via nagios plugin

Namespaces

More

Page actions

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Check cluster health via nagios plugin

Navigation

Wiki tools

Page tools

Categories