Project

General

Profile

Actions

action #42677

closed

Keep virtual machines running on demand/failure (was: Don't just power off virtual machines upon job timeout)

Added by MMoese over 5 years ago. Updated over 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
-
Start date:
2018-10-18
Due date:
% Done:

0%

Estimated time:

Description

Motivation

When testing the kernel, soft-lockups can occur, for example:
[ 140.140055] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u2:6:2060]
[ 154.592130] BUG: workqueue lockup - pool cpus=0 flags=0x4 nice=0 stuck for 40s!
[ 200.140070] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u2:6:2060]

Reproducing these is quite difficult, as is gathering logfiles (the machine hangs.. ).
In order to be usable for kernel testing, and investigation of error cause, attaching to
the VM with a debugger is necessary in this case.
So for kernel tests, instead of shutting down Qemu, the state of the machine must be
preserved in case the job does not finish prior to the timeout, otherwise it will not
be possible to identify the cause of the lockup.

Acceptance criteria

  • AC1: An openQA triggered VM is not powered off on demand whenever a (critical) failure appears
  • AC2: Any VM is not kept around indefinitely, e.g. still aborted and shutdown after a timeout

Suggestions

  • Add a flag in os-autoinst to not shutdown immediately when the job fails but keep running
  • Crosscheck that if MAX_JOB_TIME is exceeded in openQA the job and VM is still terminated
  • Add documentation with hints how to use this mode, e.g. to use and MAX_JOB_TIME in combination to keep machines around on demand for a limited time
  • Evaluate potential followup to keep machines around potentially longer, e.g. keep failed machines running for limited time automatically, use live migration or memory dumps to "pause" a machine where it can be restored from on demand
  • Optional: Add more straightforward access to this feature, e.g. GUI support

Related issues 3 (2 open1 closed)

Related to openQA Project - action #36442: Access to running SUTs for System DevelopersResolvedmkittler2018-05-23

Actions
Related to openQA Project - coordination #40058: [EPIC] Store VM state when reusing published imageNew2018-08-21

Actions
Is duplicate of openQA Project - action #12836: preserve disk image / virtual machine / keep them running in case of failures on demandWorkable2016-07-24

Actions
Actions

Also available in: Atom PDF