Troubleshooting¶
Logging¶
Every OpenNebula server generates logs with a configurable verbosity (level of detail) and through different means (file, syslog, or standard error output) to allow cloud administrators to troubleshoot the potential problems. Logs are stored in /var/log/one/
on a Front-end Host with a particular component. Some valuable error messages can be also seen by the end-users in CLI tools or the Sunstone GUI.
Configure Logging System¶
Follow the guides of each component to find the logs’ location and configuration of log verbosity:
OpenNebula Daemon: logs, configuration (parameter
LOG/DEBUG_LEVEL
)Scheduler: logs, configuration (parameter
LOG/DEBUG_LEVEL
)Monitoring: logs, configuration (parameter
LOG/DEBUG_LEVEL
)FireEdge: logs, configuration (parameter
log
)OneFlow: logs, configuration (parameter
:debug_level
)OneGate: logs, configuration (parameter
:debug_level
)
After changing the logging level, don’t forget to restart the service so that it can take effect.
Important
Logs are rotated on (re)start of a particular component. Find a historic log alongside the current logs with date/time suffixes (e.g., latest /var/log/one/oned.log
might have the following historic log /var/log/one/oned.log-20210321-1616319097
, or an even older compressed log /var/log/one/oned.log-20210314-1615719402.gz
)
Additional Resources¶
As well as the common service logs, the following are other places to investigate and troubleshoot problems:
Virtual Machines: The information specific to a VM will be dumped in the log file
/var/log/one/<vmid>.log
. All VMs controlled by OpenNebula have their own directory,/var/lib/one/vms/<VID>
if syslog/stderr isn’t enabled. You can find the following information in it:Deployment description files : Stored in
deployment.<EXECUTION>
, where<EXECUTION>
is the sequence number in the execution history of the VM (deployment.0
for the first host,deployment.1
for the second and so on).Transfer description files : Stored in
transfer.<EXECUTION>.<OPERATION>
, where<EXECUTION>
is the sequence number in the execution history of the VM, and<OPERATION>
is the stage where the script was used, e.g.transfer.0.prolog
,transfer.0.epilog
, ortransfer.1.cleanup
.
Drivers: Each driver can have its
ONE_MAD_DEBUG
variable activated in RC files. If enabled, the error information will be dumped in/var/log/one/name-of-the-driver-executable.log
. Log information from the drivers is inoned.log
.
OpenNebula Daemon Log Format¶
The structure of OpenNebula Daemon log messages for a file based logging system is the following:
date [Z<zone_id>][module][log_level]: message body
In the case of syslog it follows the standard:
date hostname process[pid]: [Z<zone_id>][module][log_level]: message
where the zone_id
is the ID of the Zone in the federation (0
for single Zone setups), the module is any of the internal OpenNebula components (VMM
, ReM
, TM
, etc.), and the log_level
is a single character indicating the log level (I
for informational, D
for debugging, etc.).
For syslog, OpenNebula will also log the Virtual Machine events like this:
date hostname process[pid]: [VM id][Z<zone_id>][module][log_level]: message
and similarly for stderr logging.
For oned
and VM events the formats are:
date [Z<zone_id>][module][log_level]: message
date [VM id][Z<zone_id>][module][log_level]: message
Infrastructure Failures¶
Virtual Machines¶
The causes of Virtual Machine errors can be found in the details of VM. Any VM owner or cloud administrator can see the error via the onevm show $ID
command (or in the Sunstone GUI). For example:
onevm show 0
VIRTUAL MACHINE 0 INFORMATION
ID : 0
NAME : one-0
USER : oneadmin
GROUP : oneadmin
STATE : ACTIVE
LCM_STATE : PROLOG_FAILED
START TIME : 07/19 17:44:20
END TIME : 07/19 17:44:31
DEPLOY ID : -
VIRTUAL MACHINE MONITORING
NET_TX : 0
NET_RX : 0
USED MEMORY : 0
USED CPU : 0
VIRTUAL MACHINE TEMPLATE
CONTEXT=[
FILES=/tmp/some_file,
TARGET=hdb ]
CPU=0.1
ERROR=[
MESSAGE="Error executing image transfer script: Error copying /tmp/some_file to /var/lib/one/0/images/isofiles",
TIMESTAMP="Tue Jul 19 17:44:31 2011" ]
MEMORY=64
NAME=one-0
VMID=0
VIRTUAL MACHINE HISTORY
SEQ HOSTNAME ACTION START TIME PTIME
0 host01 none 07/19 17:44:31 00 00:00:00 00 00:00:00
The error message here (see ERROR=[MESSAGE="Error executing image...
) shows an error when copying an image (file /tmp/some_file
). The source file most likely doesn’t exist. Alternatively, you can check the detailed log of a particular VM in /var/log/one/$ID.log
(in this case the VM has ID 0
and the log file would be /var/log/one/0.log
)
Recover from VM Failure¶
The overall state of a virtual machine in a failure condition will show as failure
(or fail
in the CLI). To find out the specific failure situation you need to check the LCM_STATE
of the VM in the VM info tab (or onevm show
in the CLI.). Moreover, a VM can be stuck in a transition (e.g. boot or save) because of a host or network failure. Typically these operations will eventually time out and lead to a VM failure state.
The administrator has the ability to force a recovery action from Sunstone or from the CLI, with the onevm recover
command. This command has the following options:
--success
: If the operation has been confirmed to succeed. For example, the administrator can see the VM properly running in the hypervisor, but the driver failed to inform OpenNebula of the successful boot.--failure
: This will have the same effect as a driver reporting a failure. It is intended for VMs that get stuck in transient states. As an example, if a storage problem occurs and the administrator knows that a VM stuck inprolog
is not going to finish the pending transfer, this action will manually move the VM toprolog_failure
.--retry
: To retry the previously failed action. It can be used, for instance, if a VM is inboot_failure
because the hypervisor crashed. The administrator can tell OpenNebula to retry the boot after the hypervisor is started again.--retry --interactive
: In some scenarios where the failure was caused by an error in the Transfer Manager actions, each action can be rerun and debugged until it works. Once the commands are successful, asuccess
should be sent. See the specific section below for more details.--delete
: No recovery action possible, delete the VM. This is equivalent to the deprecated OpenNebula < 5.0 command:onevm delete
.--delete-db
: No recover action possible, delete the VM from the DB. It does not trigger any action on the hypervisor.--recreate
: No recovery action possible, delete and recreate the VM. This is equivalent to the deprecated OpenNebula < 5.0 command:onevm delete --recreate
.
Note also that OpenNebula will try to automatically recover some failure situations using the monitor information. A specific example is that a VM in the boot_failure
state will become running
if the monitoring reports that the VM was found running in the hypervisor.
Hypervisor Problems¶
The following list details failure states caused by errors related to the hypervisor.
BOOT_FAILURE
: The VM failed to boot but all the files needed by the VM are already in the Host. Check the hypervisor logs to find out the problem and, once fixed, recover the VM with the retry option.BOOT_MIGRATE_FAILURE
: same as above but during a migration. Check the target hypervisor and retry the operation.BOOT_UNDEPLOY_FAILURE
: same as above but during a resume after an undeploy. Check the target hypervisor and retry the operation.BOOT_STOPPED_FAILURE
: same as above but during a resume after a stop. Check the target hypervisor and retry the operation.
Transfer Manager / Storage Problems¶
The following list details failure states caused by errors in the Transfer Manager driver. These states can be recovered by checking the vm.log
and looking for the specific error (disk space, permissions, misconfigured datastore, etc). You can execute --retry
to relaunch the Transfer Manager actions after fixing the problem (freeing disk space, etc). You can execute --retry --interactive
to launch a Transfer Manager Interactive Debug environment that will allow you to: (1) see all the TM actions in detail (2) relaunch each action until it’s successful (3) skip TM actions.
PROLOG_FAILURE
: there was a problem setting up the disk images needed by the VM.PROLOG_MIGRATE_FAILURE
: problem setting up the disks in the target host.EPILOG_FAILURE
: there was a problem processing the disk images (may be discard or save) after the VM execution.EPILOG_STOP_FAILURE
: there was a problem moving the disk images after a stop.EPILOG_UNDEPLOY_FAILURE
: there was a problem moving the disk images after an undeploy.PROLOG_MIGRATE_POWEROFF_FAILURE
: problem restoring the disk images after a migration in a poweroff state.PROLOG_MIGRATE_SUSPEND_FAILURE
: problem restoring the disk images after a migration in a suspend state.PROLOG_RESUME_FAILURE
: problem restoring the disk images after a stop.PROLOG_UNDEPLOY_FAILURE
: problem restoring the disk images after an undeploy.
Here’s an example of a Transfer Manager Interactive Debug environment (onevm recover <id> --retry --interactive
):
onevm show 2|grep LCM_STATE
LCM_STATE : PROLOG_UNDEPLOY_FAILURE
onevm recover 2 --retry --interactive
TM Debug Interactive Environment.
TM Action list:
(1) MV shared haddock:/var/lib/one//datastores/0/2/disk.0 localhost:/var/lib/one//datastores/0/2/disk.0 2 1
(2) MV shared haddock:/var/lib/one//datastores/0/2 localhost:/var/lib/one//datastores/0/2 2 0
Current action (1):
MV shared haddock:/var/lib/one//datastores/0/2/disk.0 localhost:/var/lib/one//datastores/0/2/disk.0 2 1
Choose action:
(r) Run action
(n) Skip to next action
(a) Show all actions
(q) Quit
> r
LOG I Command execution fail: /var/lib/one/remotes/tm/shared/mv haddock:/var/lib/one//datastores/0/2/disk.0 localhost:/var/lib/one//datastores/0/2/disk.0 2 1
LOG I ExitCode: 1
FAILURE. Repeat command.
Current action (1):
MV shared haddock:/var/lib/one//datastores/0/2/disk.0 localhost:/var/lib/one//datastores/0/2/disk.0 2 1
Choose action:
(r) Run action
(n) Skip to next action
(a) Show all actions
(q) Quit
> # FIX THE PROBLEM...
> r
SUCCESS
Current action (2):
MV shared haddock:/var/lib/one//datastores/0/2 localhost:/var/lib/one//datastores/0/2 2 0
Choose action:
(r) Run action
(n) Skip to next action
(a) Show all actions
(q) Quit
> r
SUCCESS
If all the TM actions have been successful and you want to
recover the Virtual Machine to the RUNNING state execute this command:
onevm recover 2 --success
onevm recover 2 --success
onevm show 2|grep LCM_STATE
LCM_STATE : RUNNING
Hosts¶
Host errors can be investigated via the onehost show $ID
command. For example:
onehost show 1
HOST 1 INFORMATION
ID : 1
NAME : host01
STATE : ERROR
IM_MAD : im_kvm
VM_MAD : vmm_kvm
TM_MAD : tm_shared
HOST SHARES
MAX MEM : 0
USED MEM (REAL) : 0
USED MEM (ALLOCATED) : 0
MAX CPU : 0
USED CPU (REAL) : 0
USED CPU (ALLOCATED) : 0
TOTAL VMS : 0
MONITORING INFORMATION
ERROR=[
MESSAGE="Error monitoring host 1 : MONITOR FAILURE 1 Could not update remotes",
TIMESTAMP="Tue Jul 19 17:17:22 2011" ]
The error message here (see ERROR=[MESSAGE="Error monitoring host...
) shows an error when updating remote drivers on a host. To get more information, you have to check OpenNebula Daemon log (/var/log/one/oned.log
) and, for example, see this relevant error:
Tue Jul 19 17:17:22 2011 [InM][I]: Monitoring host host01 (1)
Tue Jul 19 17:17:22 2011 [InM][I]: Command execution fail: scp -r /var/lib/one/remotes/. host01:/var/tmp/one
Tue Jul 19 17:17:22 2011 [InM][I]: ssh: Could not resolve hostname host01: nodename nor servname provided, or not known
Tue Jul 19 17:17:22 2011 [InM][I]: lost connection
Tue Jul 19 17:17:22 2011 [InM][I]: ExitCode: 1
Tue Jul 19 17:17:22 2011 [InM][E]: Error monitoring host 1 : MONITOR FAILURE 1 Could not update remotes
The error message (Could not resolve hostname
) explains there is the incorrect hostname of OpenNebula Host, which can’t be resolved in DNS.