Check cluster health via nagios plugin

From Notes_Wiki

Home > Suse > SAP setup and maintenance > Check cluster health via nagios plugin

We can monitor cluster health using nagios plugin using: Not tested in production

  1. Refer Configuring nrpe based internal service checks on how nrpe based internal checks work for remote systems.
  2. Create a plugin to be called via nrpe on the cluster host '/usr/lib64/nagios/plugins/cluster_check.sh' with:
    #!/bin/bash
    
    # Run crm status command and capture output
    crm_output=$(crm status 2>&1)
    
    # Check for error or warning in output, ignoring case
    if [[ "$crm_output" =~ error || "$crm_output" =~ warning ]]; then
      # Send email alert with hostname and IP
      hostname=$(hostname)
      ip=$(hostname -I | awk '{print $1}')
      echo "Cluster status is not healthy on $hostname ($ip)!" 
      exit 2 # Nagios exit code for critical
    fi
    
    # Check if all nodes are online
    if [[ "$crm_output" =~ Online:\ \[\ (.*)\ \] ]]; then
      online_nodes=${BASH_REMATCH[1]}
      if [[ "$online_nodes" =~ \[.*\] ]]; then
        # Send email alert with hostname and IP
        hostname=$(hostname)
        ip=$(hostname -I | awk '{print $1}')
        echo "Not all nodes are online on $hostname ($ip)!"
        exit 2 # Nagios exit code for critical
      fi
    fi
    
    # Check if all resources are started
    if [[ "$crm_output" =~ Full\ list\ of\ resources:\$'\n'\ (.*) ]]; then
      resources=${BASH_REMATCH[1]}
      if [[ "$resources" =~ \*\* ]]; then
        # Send email alert with hostname and IP
        hostname=$(hostname)
        ip=$(hostname -I | awk '{print $1}')
        echo "Not all resources are started on $hostname ($ip)!" 
        exit 2 # Nagios exit code for critical
      fi
    fi
    
    echo "Cluster status is healthy!"
    exit 0 # Nagios exit code for OK
  3. Edit '/etc/nagios/nrpe.conf' to have below:
    command[check_cluster_status]=/usr/lib64/nagios/plugins/cluster_check.sh
  4. Restart nrpe on the cluster machine
  5. Then configure remote service check using above plugin for appropriate host using below nagios service configuration:
    define host {
        use           linux-server
        host_name     example-host
        alias         Example Host
        address       192.0.2.100
    }
    
    define service {
        use                 generic-service
        host_name           example-host
        service_description Check Cluster Status
        check_command       check_nrpe!check_cluster_status
        check_interval      60 ; Check every 60 seconds
        retry_interval      10 ; Retry every 10 seconds if check fails
        notification_interval 120 ; Send a notification every 2 hours
        contact_groups      admins
    }
  6. Restart nagios service on server
  7. Validate whether proper health of cluster status is being captured
  8. Optionally stop a resource and see whether latest status is reflected properly. Consider adding a virtual IP for testing in production systems. This virtual IP can be removed after testing.


Home > Suse > SAP setup and maintenance > Check cluster health via nagios plugin