Project

General

Profile

action #105079

coordination #105073: [Epic] Improve logging in openQA

[Research][timebox: 24] Evaluate a way to snapshot and upload the entire system at the point of failure, that could be bootable as VM.

Added by JRivrain 4 months ago. Updated 3 months ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
Start date:
2022-01-19
Due date:
% Done:

0%

Estimated time:

Description

Even by collecting as many logs as we can, we often face the lack of something.
In the world of support, some companies take a radical approach to this. For example, Solaris used to have a tool that would snapshot the entire OS, that could be uploaded to support, so the support team could navigate through a tarball containing literally the entire OS. It takes some storage space, but can spare a huge amount of time:

  • going back and forth between customer and support for requesting specific logs,
  • rebuilding a system with the same characteristics etc... In some cases, we already export a qcow image at the end of installation, which can be used to reproduce bugs. But this does not represent the system at the time of failure. Qemu has snapshot capabilities that could make it possible to boot the system right after its failure. This could complement or partially replace the current log collection mechanisms.

Currently we have MAKETESTSNAPSHOTS Save snapshot for each test module in qcow image and PUBLISH_HDD_N.

AC1: Test manually how to use those qemu qcow2 that have multiple snapshots
AC2: Communicate with tools team on the possibility of having published a qcow2 just adding some openQA setting to be able to rerun the job and publish on failure.
AC3: Make a proposal of implementation and create follow-up ticket.


Related issues

Related to qe-yast - coordination #105082: Evaluate more simple and consistent ways to collect logsNew2022-01-19

History

#1 Updated by JRivrain 4 months ago

#2 Updated by JRivrain 4 months ago

  • Tags set to qe-yast-refinement

#3 Updated by JERiveraMoya 4 months ago

  • Target version set to Current

#4 Updated by JERiveraMoya 4 months ago

  • Tags deleted (qe-yast-refinement)
  • Tracker changed from coordination to action
  • Subject changed from Evaluate a way to snapshot and upload the entire system at the point of failure, that could be bootable as VM. to [Research][timebox: 24] Evaluate a way to snapshot and upload the entire system at the point of failure, that could be bootable as VM.
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by okurz 4 months ago

There is a variable FORCE_PUBLISH_HDD_$i , see https://github.com/os-autoinst/os-autoinst/blob/d466a0ee2b2b12f0a3abb60013eefe756ce67fa1/bmwqemu.pm#L141 , that can be set to force publishing an hdd image even if the job fails. This is exactly meant for the purpose of investigating. In theory one could set this variable for a complete openQA instance. We don't do that for our production systems because the overhead would be massive and thousands of failing jobs would upload images needlessly that are never looked at. On demand this can be set of course.

#6 Updated by JERiveraMoya 3 months ago

  • Status changed from Workable to Rejected

#7 Updated by okurz 3 months ago

  • Status changed from Rejected to Workable

You might be confusing things here. #90347 is about intermediate snapshots which are recorded while a test is running. The ticket here is about uploading the entire system instead. So "snapshot" is maybe ambiguous in that context. Maybe better call it "complete system image" :)

#8 Updated by JERiveraMoya 3 months ago

  • Status changed from Workable to Rejected

We thought all was the same thing initially, but that ticket I just pasted to clarify that difference between overlays and snapshots supported by qcow2.
With the setting you pointed to we cover the main thing and this ticket is not needed anymore. We can reject it.

Also available in: Atom PDF