action #12836: preserve disk image / virtual machine / keep them running in case of failures on demand - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #12836

open

preserve disk image / virtual machine / keep them running in case of failures on demand

Added by okurz over 8 years ago. Updated about 4 years ago.

Status:

Workable

Priority:

Normal

Assignee:

Category:

Feature requests

Target version:

QA (public) - future

Start date:

2016-07-24

Due date:

% Done:

Estimated time:

Description

motivation¶

https://openqa.suse.de/tests/483548#step/yast2_i/17

or also: When testing the kernel, soft-lockups can occur, for example:
[ 140.140055] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u2:6:2060]
[ 154.592130] BUG: workqueue lockup - pool cpus=0 flags=0x4 nice=0 stuck for 40s!
[ 200.140070] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u2:6:2060]

Reproducing these is quite difficult, as is gathering logfiles (the machine hangs.. ).
In order to be usable for kernel testing, and investigation of error cause, attaching to
the VM with a debugger is necessary in this case.
So for kernel tests, instead of shutting down Qemu, the state of the machine must be
preserved in case the job does not finish prior to the timeout, otherwise it will not
be possible to identify the cause of the lockup.

user story¶

As a investigator of non-obvious test failures I want (sometimes) to preserve the virtual machine (keep it running) or the disk image to be able to investigate an error.

acceptance criteria¶

AC1: either disk images or virtual machines are preserved in case of failures on demand or automatically
AC2: An openQA triggered VM is not powered off on demand whenever a (critical) failure appears
AC3: Any VM is not kept around indefinitely, e.g. still aborted and shutdown after a timeout

Suggestions¶

Add a flag in os-autoinst to not shutdown immediately when the job fails but keep running
Crosscheck that if MAX_JOB_TIME is exceeded in openQA the job and VM is still terminated
Add documentation with hints how to use this mode, e.g. to use and MAX_JOB_TIME in combination to keep machines around on demand for a limited time
Evaluate potential followup to keep machines around potentially longer, e.g. keep failed machines running for limited time automatically, use live migration or memory dumps to "pause" a machine where it can be restored from on demand
Optional: Add more straightforward access to this feature, e.g. GUI support

Further details¶

It might be feasible to let virtual machines run for some time, e.g. 1 hour, in case of a failure, which can give investigators some time to debug or at least copy the image while it's running. Also, the image could be "published" and stored and cleaned according to existing or to-be-defined GRU-cleanup rules.

An alternative would be for an openQA operator to request the preservation by a variable. E.g. clone a consistently failing job and set variable. PUBLISH_HDD_1 is near but does not work as it only publishes on success. Another variable could be set to publish regardless of result or another variable altogether to store "failed" images, e.g. PUBLISH_FAILED_HDD_1.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by okurz almost 8 years ago

Priority changed from Normal to Low

long time requested, not easy to do. we have #14068 now which allows in the current state to save disk images and memory dumps on request from test code.

Actions

Copy link

Updated by okurz almost 8 years ago

Category changed from 132 to Feature requests
Target version set to future

Actions

Copy link

Updated by okurz almost 8 years ago

Related to action #14068: [tools] Gather more system information and logs in case of boot/reboot times out added

Actions

Copy link

Updated by okurz over 6 years ago

Target version changed from future to future

Actions

Copy link

Updated by okurz about 5 years ago

Subject changed from preserve disk image / virtual machine in case of failures on demand to preserve disk image / virtual machine / keep them running in case of failures on demand
Description updated (diff)
Status changed from New to Workable
Priority changed from Low to Normal
Target version deleted (~~future~~)

Incorporated details from near-duplicate #42677

Actions

Copy link

Updated by okurz about 5 years ago

Has duplicate action #42677: Keep virtual machines running on demand/failure (was: Don't just power off virtual machines upon job timeout) added

Actions

Copy link

Updated by okurz over 4 years ago

Target version set to Ready

Actions

Copy link

Updated by okurz about 4 years ago

Target version changed from Ready to future

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #12836

preserve disk image / virtual machine / keep them running in case of failures on demand

motivation¶

user story¶

acceptance criteria¶

Suggestions¶

Further details¶

Updated by okurz almost 8 years ago

Updated by okurz almost 8 years ago

Updated by okurz almost 8 years ago

Updated by okurz over 6 years ago

Updated by okurz about 5 years ago

Updated by okurz about 5 years ago

Updated by okurz over 4 years ago

Updated by okurz about 4 years ago