action #105079
closed
coordination #105073: [Epic] Improve logging in openQA
[Research][timebox: 24] Evaluate a way to snapshot and upload the entire system at the point of failure, that could be bootable as VM.
Added by JRivrain almost 3 years ago.
Updated over 2 years ago.
Description
Even by collecting as many logs as we can, we often face the lack of something.
In the world of support, some companies take a radical approach to this. For example, Solaris used to have a tool that would snapshot the entire OS, that could be uploaded to support, so the support team could navigate through a tarball containing literally the entire OS. It takes some storage space, but can spare a huge amount of time:
- going back and forth between customer and support for requesting specific logs,
- rebuilding a system with the same characteristics etc...
In some cases, we already export a qcow image at the end of installation, which can be used to reproduce bugs. But this does not represent the system at the time of failure. Qemu has snapshot capabilities that could make it possible to boot the system right after its failure.
This could complement or partially replace the current log collection mechanisms.
Currently we have MAKETESTSNAPSHOTS Save snapshot for each test module in qcow image and PUBLISH_HDD_N.
AC1: Test manually how to use those qemu qcow2 that have multiple snapshots
AC2: Communicate with tools team on the possibility of having published a qcow2 just adding some openQA setting to be able to rerun the job and publish on failure.
AC3: Make a proposal of implementation and create follow-up ticket.
- Tags set to qe-yast-refinement
- Target version set to Current
- Tags deleted (
qe-yast-refinement)
- Tracker changed from coordination to action
- Subject changed from Evaluate a way to snapshot and upload the entire system at the point of failure, that could be bootable as VM. to [Research][timebox: 24] Evaluate a way to snapshot and upload the entire system at the point of failure, that could be bootable as VM.
- Description updated (diff)
- Status changed from New to Workable
There is a variable FORCE_PUBLISH_HDD_$i , see https://github.com/os-autoinst/os-autoinst/blob/d466a0ee2b2b12f0a3abb60013eefe756ce67fa1/bmwqemu.pm#L141 , that can be set to force publishing an hdd image even if the job fails. This is exactly meant for the purpose of investigating. In theory one could set this variable for a complete openQA instance. We don't do that for our production systems because the overhead would be massive and thousands of failing jobs would upload images needlessly that are never looked at. On demand this can be set of course.
- Status changed from Workable to Rejected
- Status changed from Rejected to Workable
You might be confusing things here. #90347 is about intermediate snapshots which are recorded while a test is running. The ticket here is about uploading the entire system instead. So "snapshot" is maybe ambiguous in that context. Maybe better call it "complete system image" :)
- Status changed from Workable to Rejected
We thought all was the same thing initially, but that ticket I just pasted to clarify that difference between overlays and snapshots supported by qcow2.
With the setting you pointed to we cover the main thing and this ticket is not needed anymore. We can reject it.
Also available in: Atom
PDF