Alert Manager¶

Installation and Configuration¶

Note

If you are already running the Prometheus AlertManager you can skip this section and add the alarms described in the next section to your rules file.

AlertManager is part of the Prometheus distribution and should already be installed in your system after completing the installation process, see more details here.

Now you just need to enable and start the AlertManager service:

systemctl enable --now opennebula-alertmanager.service

The configuration file for the AlertManager can be found in /etc/one/alertmanager/alertmanager.yml. By default, the only receiver configured is a webhook listening on a local port. AlertManager includes several options to notify the alarms, please refer to the Prometheus documentation on Alerting Configuration to setup your own receiver.

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Alerts Rules¶

We provide some pre-defined alert rules that cover the most common use cases for an OpenNebula cloud. These rules are not intended to use as-is, but as a starting point to define the alert situations for your specific use case. Please review the Prometheus documentation to adapt the provided alert rules.

Alert Rules can be found in /etc/one/prometheus/rules.yml

Group: AllInstances¶

Name	Severity	Description
InstanceDown	critical	Server is down for more than 30s
	`up == 0`
DiskFree	warning	Server has less than 10% of free space in rootfs
	`(node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} / ...) <= 10`
FreeMemory10	warning	Server has less than 10% of free memory
	`((node_memory_MemFree_bytes * 100) / node_memory_MemTotal_bytes) <= 10`
LoadAverage15	warning	Server has more that 90% of load average in last 15 minutes
	`node_load15 > 90`
RebootInLast5Minutes	Server has has been rebooted in last 5 minutes
	`rate(node_boot_time_seconds[5m]) > 0`

Group: OpenNebulaHosts¶

Name	Severity	Description
HostDown	critical	OpenNebula Host is down for more than 30s
	`opennebula_host_state != 2`
LibvirtDown	critical	Libvirt daemon on host has been down for more than 30 seconds
	`opennebula_libvirt_daemon_up == 0`

Group: OpenNebulaVirtualMachines¶

Name	Severity	Description
VMFailed	critical	OpenNebula VMs in failed state more than 30 seconds
	`count(opennebula_vm_lcm_state == 44 or ...) > 0`
VMPending	critical	OpenNebula VMs in pending state more than 300 seconds
	`count(opennebula_vm_state == 1) > 0`

Group: OpenNebulaServices¶

Name	Severity	Description
OnedDown	critical	OpenNebula oned service is down for more than 30s
	`opennebula_oned_state == 0`
SchedulerDown	critical	OpenNebula scheduler service is down for more than 30s
	`opennebula_scheduler_state == 0`
HookManagerDown	critical	OpenNebula hook manager service is down for more than 30s
	`opennebula_hem_state == 0`

Setting up Alarms for OpenNebula in HA¶

Important

To avoid duplicate alert notifications you should configure all your alertmanager instances to run in HA mode, then point all your prometheus instances to them.

Please refer to the Using Prometheus with OpenNebula in HA section for details.