Alert Manager¶
Installation and Configuration¶
Note
If you are already running the Prometheus AlertManager you can skip this section and add the alarms described in the next section to your rules file.
AlertManager is part of the Prometheus distribution and should already be installed in your system after completing the installation process, see more details here.
Now you just need to enable and start the AlertManager service:
systemctl enable --now opennebula-alertmanager.service
The configuration file for the AlertManager can be found in /etc/one/alertmanager/alertmanager.yml
. By default, the only receiver configured is a webhook listening on a local port. AlertManager includes several options to notify the alarms, please refer to the Prometheus documentation on Alerting Configuration to setup your own receiver.
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Alerts Rules¶
We provide some pre-defined alert rules that cover the most common use cases for an OpenNebula cloud. These rules are not intended to use as-is, but as a starting point to define the alert situations for your specific use case. Please review the Prometheus documentation to adapt the provided alert rules.
Alert Rules can be found in /etc/one/prometheus/rules.yml
Group: AllInstances¶
Name |
Severity |
Description |
---|---|---|
InstanceDown |
critical |
Server is down for more than 30s |
|
||
DiskFree |
warning |
Server has less than 10% of free space in rootfs |
|
||
FreeMemory10 |
warning |
Server has less than 10% of free memory |
|
||
LoadAverage15 |
warning |
Server has more that 90% of load average in last 15 minutes |
|
||
RebootInLast5Minutes |
Server has has been rebooted in last 5 minutes |
|
|
Group: OpenNebulaHosts¶
Name |
Severity |
Description |
---|---|---|
HostDown |
critical |
OpenNebula Host is down for more than 30s |
|
||
LibvirtDown |
critical |
Libvirt daemon on host has been down for more than 30 seconds |
|
Group: OpenNebulaVirtualMachines¶
Name |
Severity |
Description |
---|---|---|
VMFailed |
critical |
OpenNebula VMs in failed state more than 30 seconds |
|
||
VMPending |
critical |
OpenNebula VMs in pending state more than 300 seconds |
|
Group: OpenNebulaServices¶
Name |
Severity |
Description |
---|---|---|
OnedDown |
critical |
OpenNebula oned service is down for more than 30s |
|
||
SchedulerDown |
critical |
OpenNebula scheduler service is down for more than 30s |
|
||
HookManagerDown |
critical |
OpenNebula hook manager service is down for more than 30s |
|
Setting up Alarms for OpenNebula in HA¶
Important
To avoid duplicate alert notifications you should configure all your alertmanager instances to run in HA mode, then point all your prometheus instances to them.
Please refer to the Using Prometheus with OpenNebula in HA section for details.