action #42677: Keep virtual machines running on demand/failure (was: Don't just power off virtual machines upon job timeout) - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #42677

closed

Keep virtual machines running on demand/failure (was: Don't just power off virtual machines upon job timeout)

Added by MMoese over 6 years ago. Updated over 5 years ago.

Status:

Rejected

Priority:

Normal

Assignee:

okurz

Category:

Feature requests

Target version:

Start date:

2018-10-18

Due date:

% Done:

Estimated time:

Description

Motivation¶

When testing the kernel, soft-lockups can occur, for example:
[ 140.140055] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u2:6:2060]
[ 154.592130] BUG: workqueue lockup - pool cpus=0 flags=0x4 nice=0 stuck for 40s!
[ 200.140070] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u2:6:2060]

Reproducing these is quite difficult, as is gathering logfiles (the machine hangs.. ).
In order to be usable for kernel testing, and investigation of error cause, attaching to
the VM with a debugger is necessary in this case.
So for kernel tests, instead of shutting down Qemu, the state of the machine must be
preserved in case the job does not finish prior to the timeout, otherwise it will not
be possible to identify the cause of the lockup.

Acceptance criteria¶

AC1: An openQA triggered VM is not powered off on demand whenever a (critical) failure appears
AC2: Any VM is not kept around indefinitely, e.g. still aborted and shutdown after a timeout

Suggestions¶

Add a flag in os-autoinst to not shutdown immediately when the job fails but keep running
Crosscheck that if MAX_JOB_TIME is exceeded in openQA the job and VM is still terminated
Add documentation with hints how to use this mode, e.g. to use and MAX_JOB_TIME in combination to keep machines around on demand for a limited time
Evaluate potential followup to keep machines around potentially longer, e.g. keep failed machines running for limited time automatically, use live migration or memory dumps to "pause" a machine where it can be restored from on demand
Optional: Add more straightforward access to this feature, e.g. GUI support

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by coolo over 6 years ago

Subject changed from Don't just power off virtual machines upon job timeout to [epic] Don't just power off virtual machines upon job timeout
Category changed from Regressions/Crashes to Feature requests
Priority changed from High to Normal

Not a bug - not by far

Actions

Copy link

Updated by coolo over 6 years ago

relying on the job timeout as test metric sounds pretty wild to me. Time expectations belong in test modules - and post_fail_hooks are there to write machine state.

Actions

Copy link

Updated by MMoese over 6 years ago

This has nothing to do with time expectations of a test. But at some point, there is a timeout that ends the job.
And if the kernel locked up, there's no chance to collect system state by running some commands.

Actions

Copy link

Updated by morbidrsa over 6 years ago

Is it possible to enter qemu monitor in a post_fail_hook to write out the vmstate (i.e. do a live migration to a local file), so we can have something to debug hard- or soft-lockups, or even kernel panics?

Actions

Copy link

Updated by okurz over 6 years ago

morbidrsa wrote:

Is it possible to enter qemu monitor in a post_fail_hook to write out the vmstate (i.e. do a live migration to a local file), so we can have something to debug hard- or soft-lockups, or even kernel panics?

Yes, both should be possible in theory, for the "live migration" that is possible using save_memory_dump

Actions

Copy link

Updated by okurz over 6 years ago

Related to action #36442: Access to running SUTs for System Developers added

Actions

Copy link

Updated by okurz over 5 years ago

Related to coordination #40058: [EPIC] Store VM state when reusing published image added

Actions

Copy link

Updated by okurz over 5 years ago

Subject changed from [epic] Don't just power off virtual machines upon job timeout to Keep virtual machines running on demand/failure (was: Don't just power off virtual machines upon job timeout)
Description updated (diff)
Status changed from New to Workable

Rewritten using the feature request template with my understanding of what is feasible to achieve and suggestions.

Actions

Copy link

Updated by okurz over 5 years ago

Status changed from Workable to Rejected
Assignee set to okurz

Merged with #12836

Actions

Copy link

#10

Updated by okurz over 5 years ago

Is duplicate of action #12836: preserve disk image / virtual machine / keep them running in case of failures on demand added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #42677

Keep virtual machines running on demand/failure (was: Don't just power off virtual machines upon job timeout)

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by coolo over 6 years ago

Updated by coolo over 6 years ago

Updated by MMoese over 6 years ago

Updated by morbidrsa over 6 years ago

Updated by okurz over 6 years ago

Updated by okurz over 6 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago

Updated by okurz over 5 years ago