Alert Manager

Installation and Configuration

AlertManager is part of the Prometheus distribution and should already be installed in your system after completing the installation process, see more details here.

Now you just need to enable and start the AlertManager service:

# systemctl enable --now opennebula-alertmanager.service

The configuration file for the AlertManager can be found in /etc/one/alertmanager/alertmanager.yml. By default, the only receiver configured is a webhook listening on a local port. AlertManager includes several options to notify the alarms, please refer to the Prometheus documentation on Alerting Configuration to setup your own receiver.

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Alerts Rules

We provide some pre-defined alert rules that cover the most common use cases for an OpenNebula cloud. These rules are not intended to use as-is, but as a starting point to define the alert situations for your specific use case. Please review the Prometheus documentation to adapt the provided alert rules.

Alert Rules can be found in /etc/one/prometheus/rules.yml.

Group: AllInstances

NameSeverityDescription
InstanceDowncriticalServer is down for more than 30s
up == 0
DiskFreewarningServer has less than 10% of free space in rootfs
(node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / ...) <= 10
FreeMemory10warningServer has less than 10% of free memory
((node_memory_MemFree_bytes * 100) / node_memory_MemTotal_bytes) <= 10
LoadAverage15warningServer has more that 90% of load average in last 15 minutes
node_load15 > 90
RebootInLast5MinutesServer has has been rebooted in last 5 minutes
rate(node_boot_time_seconds[5m]) > 0

Group: OpenNebulaHosts

NameSeverityDescription
HostDowncriticalOpenNebula Host is down for more than 30s
opennebula_host_state != 2
LibvirtDowncriticalLibvirt daemon on host has been down for more than 30 seconds
opennebula_libvirt_daemon_up == 0

Group: OpenNebulaVirtualMachines

NameSeverityDescription
VMFailedcriticalOpenNebula VMs in failed state more than 30 seconds
count(opennebula_vm_lcm_state == 44 or ...) > 0
VMPendingcriticalOpenNebula VMs in pending state more than 300 seconds
count(opennebula_vm_state == 1) > 0

Group: OpenNebulaServices

NameSeverityDescription
OnedDowncriticalOpenNebula oned service is down for more than 30s
opennebula_oned_state == 0
SchedulerDowncriticalOpenNebula scheduler service is down for more than 30s
opennebula_scheduler_state == 0
HookManagerDowncriticalOpenNebula hook manager service is down for more than 30s
opennebula_hem_state == 0

Setting up Alarms for OpenNebula in HA

then point all your prometheus instances to them.

Please refer to the Using Prometheus with OpenNebula in HA section for details.