action #14068
closed[tools] Gather more system information and logs in case of boot/reboot times out
100%
Description
observation¶
For example https://openqa.suse.de/tests/600788#step/zypper_migration/9 fails after migration. In "first_boot" we already support some error handling to press "esc" but we need it also here and in reboot_gnome and also in case linuxrc does not boot up and is stuck in progress bar, see bsc#999231.
problem¶
As online migration fails often in current cases we want this urgently.
Gathering logs is not easy as the system can stop in very different steps and also in many cases does not allow to log into an existing shell (e.g. stuck during boot).
suggestion¶
- Press "esc" in case of boot/reboot times out to gather some console information as we already do in "first_boot"
- Instruct qemu backends to do a memory dump and save it as we do for logs
- Add qemu backend support to save memory dump
- In post_fail_hook of corresponding tests (start with first_boot) trigger the memdump
- Save the memdump to be accessible
- Make sure the size of memdump is not too big (e.g. < 2MB) as we have like 1000 failing tests each day and not infinite disk space
Checklist
- Have a mock method in backend baseclass so memory dump method can be called safely without crashing
- Have qemu backend to support memory dumps.
- Have the WebUI/Worker to upload the memory dumps.
- Have the WebUI display the memory dumps
- Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed
- Have the webUI to display the command line needed to respawn the VM.
Updated by okurz about 8 years ago
- Subject changed from Press "esc" in case of boot/reboot times out to gather some console information to Gather more system information and logs in case of boot/reboot times out
- Description updated (diff)
Updated by okurz about 8 years ago
- Related to action #14086: [Build 2160] test zypper_patch fails to reboot on ppc64le added
Updated by szarate about 8 years ago
- Checklist item changed from to [ ] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [ ] Have qemu backend to support memory dumps., [ ] Have the WebUI/Worker to upload the memory dumps.
- Status changed from New to In Progress
- Assignee set to szarate
- Start date changed from 2016-10-05 to 2016-10-10
Feature is being developed at: https://github.com/foursixnine/os-autoinst/tree/feature/save-vm-state
Updated by szarate about 8 years ago
- Checklist item changed from to [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing
Updated by szarate about 8 years ago
- Checklist item changed from to [x] Have qemu backend to support memory dumps.
Updated by szarate about 8 years ago
- Checklist item changed from to [x] Have the WebUI/Worker to upload the memory dumps.
Updated by szarate about 8 years ago
- Checklist item changed from [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps. to [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps., [ ] Have the WebUI display the memory dumps, [ ] Add on the exact line needed to restore
Updated by szarate about 8 years ago
- Checklist item changed from to [x] Have the WebUI display the memory dumps
Updated by agraf@suse.de about 8 years ago
On 10/18/2016 01:15 PM, redmine@opensuse.org wrote:
[openSUSE Tracker]
Issue #14068 has been updated by szarate.Checklist set to [x] Have the WebUI display the memory dumps
action #14068: Gather more system information and logs in case of boot/reboot times out
https://progress.opensuse.org/issues/14068#change-30078
- Author: okurz
- Status: In Progress
- Priority: Urgent
- Assignee: szarate
- Category: Enhancement to existing tests
* Target version:¶
observation¶
For example https://openqa.suse.de/tests/600788#step/zypper_migration/9 fails after migration. In "first_boot" we already support some error handling to press "esc" but we need it also here and in reboot_gnome and also in case linuxrc does not boot up and is stuck in progress bar, see bsc#999231.
problem¶
As online migration fails often in current cases we want this urgently.
Gathering logs is not easy as the system can stop in very different steps and also in many cases does not allow to log into an existing shell (e.g. stuck during boot).suggestion¶
- Press "esc" in case of boot/reboot times out to gather some console information as we already do in "first_boot"
- Instruct qemu backends to do a memory dump and save it as we do for logs
- Add qemu backend support to save memory dump
- In post_fail_hook of corresponding tests (start with first_boot) trigger the memdump
- Save the memdump to be accessible
- Make sure the size of memdump is not too big (e.g. < 2MB) as we have like 1000 failing tests each day and not infinite disk space
A "normal" migration stream should be ~400MB. With 1000 failures a day
that means <400GB of data for a day. So with 1000 failing tests per day
and a 4TB disk (which is ~$150) you can easily store 10 days worth of
failures. If you run out of disk space, FIFO delete the old dumps...
If you also want to save disk images in parallel, expect dumps to use up
maybe ~4GB. So with 1000 failures you still get 1 day worth of failures
on that disk. Just get yourself a system with 10 disks (read: <$5k) and
you're back to 10 days worth of failures.
We're talking about storage sizes here that really shouldn't be a
problem. If we save two developers one week of debugging each we've
already created a net win.
Alex
Updated by szarate about 8 years ago
- Checklist item changed from [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps., [x] Have the WebUI display the memory dumps, [ ] Add on the exact line needed to restore to [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps., [x] Have the WebUI display the memory dumps, [ ] Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed, [ ] Have the webUI to display the command line needed to respawn the VM.
Updated by szarate about 8 years ago
- Status changed from In Progress to Feedback
The feature on the backend side can be considered ready. WebUI house keeping is still missing (Which would address Alexander's concerns).
There's currently a bug, that triggers when there's nothing on the pool directory, and leaves the machine migration stalled. I still need to hunt that bug down.
Updated by okurz about 8 years ago
https://bugzilla.suse.com/show_bug.cgi?id=1005883 is a bug report using the memory dump and the disk image. So, how can this memory dump be used?
What I did:
- trigger https://openqa.suse.de/tests/621517 and enable interactive mode, wait until it got stuck trying to reboot
- as there was no
ulogs
directory yet in the pools directory I created one withsudo -u _openqa-worker mkdir -p /var/lib/openqa/pool/8/ulogs
- login to openqaworker2, look in the process table for the telnet port of the qemu instance
- connect with telnet to port
- call
exec:gzip -c > ulogs/t1234-vm-memory-dump.gz
- also I saved the disk image with
cp -a /var/lib/openqa/pool/8/raid/1 /var/lib/openqa/pool/8/ulogs/disk_image_hung_in_shutdown_before_reboot.qcow2
- let the test fail and therefore upload everything under the
ulogs
directory to the webui
Updated by okurz about 8 years ago
why are you setting it to feedback then? If you have questions I suggest you ask them :-) Otherwise, you can also keep the ticket in state "In Progress" and unassign if you can't continue on it.
The PR for the os-autoinst change is: https://github.com/os-autoinst/os-autoinst/pull/621
Updated by szarate about 8 years ago
I have filled bsc#1008148 which is actually the reason why the memory dumps were being stalled.
Looks like currently a migration after a snapshot has been created, is impossible, as the snapshot part of the code, is not cleaning up the migration state, therefore rendering any other migration performed by the user, unable to be performed, This commit solves the problem.
Updated by maritawerner about 8 years ago
- Related to action #13896: collect linuxrc logs on installation startup problems / turn off plymouth to debug startup problems added
Updated by maritawerner about 8 years ago
- Related to action #12246: [tools]upload of log files can fail sometimes (was: https://openqa.suse.de/tests/412464 has no X-related log files) added
Updated by maritawerner about 8 years ago
Added link: Related to action #12246: https://openqa.suse.de/tests/412464 has no X-related log files
Updated by szarate about 8 years ago
@maritawerner, @okurz i belive that #12246 is more related to #14902 and it's solution than this one.
Updated by okurz about 8 years ago
No, because #14902 is about "no proper log at all". and #12246 is about logs in specific cases. Keep in mind that both this ticket here and #12246 are from tests reviewers perspective, #14902 is an openQA or backend issue relevant for admins of the test infrastructure, e.g. why the communication between webui and worker breaks.
Updated by szarate about 8 years ago
- Status changed from Feedback to Resolved
Well BSC#1008148 has been marked as resolved, we now can safely roll with this.
I have a WIP for the last two items, will work on this later on, but currently the documentation is enough to get started to use the feature.
Updated by okurz about 8 years ago
@szarate can you help with actually using these feature and also help with "simpler" tasks to get our test failure investigation in better shape?: E.g. see #15170 and other subtickets of this one.
oh, and a hint regarding bugs in RESOLVED FIXED. We as QA should tend to set it to VERIFIED FIXED if we can actually verify the bugfix in our products, i.e. after a build includes this.
Updated by okurz almost 8 years ago
szarate, updates on this? As I can see from the checklist there are still two tasks although I think "Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed" should be done.
Updated by szarate almost 8 years ago
As we have now the qemu fixed, i think it'll be a good time to add this... i might work on this during this week
Updated by okurz almost 8 years ago
- Related to action #12836: preserve disk image / virtual machine / keep them running in case of failures on demand added
Updated by okurz almost 8 years ago
- Checklist item changed from to [x] Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed
Updated by RBrownSUSE almost 8 years ago
- Subject changed from Gather more system information and logs in case of boot/reboot times out to [tools]Gather more system information and logs in case of boot/reboot times out
Updated by dzedro almost 8 years ago
Memory dump died, DIE Migration failed: desc: There's a migration process in progress, class: GenericError, stopped at /usr/lib/os-autoinst/backend/qemu.pm line 169.
https://openqa.suse.de/tests/815304
Updated by okurz over 7 years ago
- Related to action #16520: [qam][opensuse][sle][functional] enhance logging and debugging in case of failed shutdown, e.g. press 'esc' on plymouth splash screen added
Updated by okurz over 6 years ago
- Related to action #34609: [sle][functional][u][medium] Improve Implementation of workaround for bsc#1083646 and debug output in reconnect_s390 on S390-KVM added
Updated by okurz over 6 years ago
With https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4843 I added a call to help investigation of bootup problems on s390x z/VM sending magic-sysrq-w to show "blocked tasks". This can help in case we can not even login. This might be limited to some platforms where the rescue mode is not available/working. Still, for automatic investigation this can help helpful.
Updated by szarate over 6 years ago
@okurz: I wonder if poo#36601 is needed at all
Updated by okurz over 6 years ago
well, I think the whole memory dump feature is useless as long as we do not use it on a regular base, e.g. see the post_fail_hook of tests/installation/first_boot.pm in os-autoinst-distri-opensuse referencing https://progress.opensuse.org/issues/19390 and such. I think this could be revisited. Also, a command to respawn the VM shown to a bug assignee one way or another would be helpful. Of course, not just output the qemu command line from the autoinst-log.txt but loading a memory dump.
Updated by okurz almost 5 years ago
- Subject changed from [tools]Gather more system information and logs in case of boot/reboot times out to [tools] Gather more system information and logs in case of boot/reboot times out
- Status changed from Feedback to Resolved
- Assignee set to okurz
I guess it's ok if we keep the subticket #36601 (in the parent project) open and close this "test related" issue especially as since then we have even better logs and information, e.g. magic sysrq on unresponsive systems, etc.