action #14068: [tools] Gather more system information and logs in case of boot/reboot times out - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

#1

Updated by okurz over 8 years ago

Subject changed from Press "esc" in case of boot/reboot times out to gather some console information to Gather more system information and logs in case of boot/reboot times out
Description updated (diff)

Actions

Copy link

#2

Updated by okurz over 8 years ago

Related to action #14086: [Build 2160] test zypper_patch fails to reboot on ppc64le added

Actions

Copy link

#3

Updated by szarate over 8 years ago

Checklist item changed from to [ ] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [ ] Have qemu backend to support memory dumps., [ ] Have the WebUI/Worker to upload the memory dumps.
Status changed from New to In Progress
Assignee set to szarate
Start date changed from 2016-10-05 to 2016-10-10

Feature is being developed at: https://github.com/foursixnine/os-autoinst/tree/feature/save-vm-state

Actions

Copy link

#4

Updated by szarate over 8 years ago

Checklist item changed from to [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing

Actions

Copy link

#5

Updated by szarate over 8 years ago

Checklist item changed from to [x] Have qemu backend to support memory dumps.

Actions

Copy link

#6

Updated by szarate over 8 years ago

Checklist item changed from to [x] Have the WebUI/Worker to upload the memory dumps.

Actions

Copy link

#7

Updated by szarate over 8 years ago

Checklist item changed from [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps. to [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps., [ ] Have the WebUI display the memory dumps, [ ] Add on the exact line needed to restore

Actions

Copy link

#8

Updated by szarate over 8 years ago

Checklist item changed from to [x] Have the WebUI display the memory dumps

Actions

Copy link

#9

Updated by agraf@suse.de over 8 years ago

On 10/18/2016 01:15 PM, redmine@opensuse.org wrote:

[openSUSE Tracker]
Issue #14068 has been updated by szarate.

Checklist set to [x] Have the WebUI display the memory dumps

action #14068: Gather more system information and logs in case of boot/reboot times out
https://progress.opensuse.org/issues/14068#change-30078

Author: okurz

Status: In Progress

Priority: Urgent

Assignee: szarate

Category: Enhancement to existing tests

Target version:

observation¶

For example https://openqa.suse.de/tests/600788#step/zypper_migration/9 fails after migration. In "first_boot" we already support some error handling to press "esc" but we need it also here and in reboot_gnome and also in case linuxrc does not boot up and is stuck in progress bar, see bsc#999231.

problem¶

As online migration fails often in current cases we want this urgently.
Gathering logs is not easy as the system can stop in very different steps and also in many cases does not allow to log into an existing shell (e.g. stuck during boot).

suggestion¶

Press "esc" in case of boot/reboot times out to gather some console information as we already do in "first_boot"

Instruct qemu backends to do a memory dump and save it as we do for logs

Add qemu backend support to save memory dump

In post_fail_hook of corresponding tests (start with first_boot) trigger the memdump

Save the memdump to be accessible

Make sure the size of memdump is not too big (e.g. < 2MB) as we have like 1000 failing tests each day and not infinite disk space

A "normal" migration stream should be ~400MB. With 1000 failures a day
that means <400GB of data for a day. So with 1000 failing tests per day
and a 4TB disk (which is ~$150) you can easily store 10 days worth of
failures. If you run out of disk space, FIFO delete the old dumps...

If you also want to save disk images in parallel, expect dumps to use up
maybe ~4GB. So with 1000 failures you still get 1 day worth of failures
on that disk. Just get yourself a system with 10 disks (read: <$5k) and
you're back to 10 days worth of failures.

We're talking about storage sizes here that really shouldn't be a
problem. If we save two developers one week of debugging each we've
already created a net win.

Alex

Actions

Copy link

#10

Updated by szarate over 8 years ago

Checklist item changed from [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps., [x] Have the WebUI display the memory dumps, [ ] Add on the exact line needed to restore to [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps., [x] Have the WebUI display the memory dumps, [ ] Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed, [ ] Have the webUI to display the command line needed to respawn the VM.

Actions

Copy link

#11

Updated by szarate over 8 years ago

Status changed from In Progress to Feedback

The feature on the backend side can be considered ready. WebUI house keeping is still missing (Which would address Alexander's concerns).

There's currently a bug, that triggers when there's nothing on the pool directory, and leaves the machine migration stalled. I still need to hunt that bug down.

Actions

Copy link

#12

Updated by okurz over 8 years ago

https://bugzilla.suse.com/show_bug.cgi?id=1005883 is a bug report using the memory dump and the disk image. So, how can this memory dump be used?

What I did:

trigger https://openqa.suse.de/tests/621517 and enable interactive mode, wait until it got stuck trying to reboot
as there was no ulogs directory yet in the pools directory I created one with sudo -u _openqa-worker mkdir -p /var/lib/openqa/pool/8/ulogs
login to openqaworker2, look in the process table for the telnet port of the qemu instance
connect with telnet to port
call exec:gzip -c > ulogs/t1234-vm-memory-dump.gz
also I saved the disk image with cp -a /var/lib/openqa/pool/8/raid/1 /var/lib/openqa/pool/8/ulogs/disk_image_hung_in_shutdown_before_reboot.qcow2
let the test fail and therefore upload everything under the ulogs directory to the webui

Actions

Copy link

#13

Updated by okurz over 8 years ago

why are you setting it to feedback then? If you have questions I suggest you ask them :-) Otherwise, you can also keep the ticket in state "In Progress" and unassign if you can't continue on it.

The PR for the os-autoinst change is: https://github.com/os-autoinst/os-autoinst/pull/621

Actions

Copy link

#14

Updated by szarate over 8 years ago

I have filled bsc#1008148 which is actually the reason why the memory dumps were being stalled.

Looks like currently a migration after a snapshot has been created, is impossible, as the snapshot part of the code, is not cleaning up the migration state, therefore rendering any other migration performed by the user, unable to be performed, This commit solves the problem.

Actions

Copy link

#15

Updated by maritawerner over 8 years ago

Related to action #13896: collect linuxrc logs on installation startup problems / turn off plymouth to debug startup problems added

Actions

Copy link

#16

Updated by maritawerner over 8 years ago

Related to action #12246: [tools]upload of log files can fail sometimes (was: https://openqa.suse.de/tests/412464 has no X-related log files) added

Actions

Copy link

#17

Updated by maritawerner over 8 years ago

Added link: Related to action #12246: https://openqa.suse.de/tests/412464 has no X-related log files

Actions

Copy link

#18

Updated by szarate over 8 years ago

@maritawerner, @okurz i belive that #12246 is more related to #14902 and it's solution than this one.

Actions

Copy link

#19

Updated by okurz over 8 years ago

No, because #14902 is about "no proper log at all". and #12246 is about logs in specific cases. Keep in mind that both this ticket here and #12246 are from tests reviewers perspective, #14902 is an openQA or backend issue relevant for admins of the test infrastructure, e.g. why the communication between webui and worker breaks.

Actions

Copy link

#20

Updated by szarate over 8 years ago

Status changed from Feedback to Resolved

Well BSC#1008148 has been marked as resolved, we now can safely roll with this.

I have a WIP for the last two items, will work on this later on, but currently the documentation is enough to get started to use the feature.

Actions

Copy link

#21

Updated by szarate over 8 years ago

Status changed from Resolved to Feedback

Whops :)

Actions

Copy link

#22

Updated by okurz over 8 years ago

@szarate can you help with actually using these feature and also help with "simpler" tasks to get our test failure investigation in better shape?: E.g. see #15170 and other subtickets of this one.

oh, and a hint regarding bugs in RESOLVED FIXED. We as QA should tend to set it to VERIFIED FIXED if we can actually verify the bugfix in our products, i.e. after a build includes this.

Actions

Copy link

#23

Updated by okurz over 8 years ago

szarate, updates on this? As I can see from the checklist there are still two tasks although I think "Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed" should be done.

Actions

Copy link

#24

Updated by szarate over 8 years ago

As we have now the qemu fixed, i think it'll be a good time to add this... i might work on this during this week

Actions

Copy link

#25

Updated by okurz over 8 years ago

Related to action #12836: preserve disk image / virtual machine / keep them running in case of failures on demand added

Actions

Copy link

#26

Updated by okurz over 8 years ago

Checklist item changed from to [x] Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed

Actions

Copy link

#27

Updated by RBrownSUSE about 8 years ago

Subject changed from Gather more system information and logs in case of boot/reboot times out to [tools]Gather more system information and logs in case of boot/reboot times out

Actions

Copy link

#28

Updated by dzedro about 8 years ago

Memory dump died, DIE Migration failed: desc: There's a migration process in progress, class: GenericError, stopped at /usr/lib/os-autoinst/backend/qemu.pm line 169.
https://openqa.suse.de/tests/815304

Actions

Copy link

#29

Updated by okurz about 8 years ago

Related to action #16520: [qam][opensuse][sle][functional] enhance logging and debugging in case of failed shutdown, e.g. press 'esc' on plymouth splash screen added

Actions

Copy link

#30

Updated by okurz about 7 years ago

Related to action #34609: [sle][functional][u][medium] Improve Implementation of workaround for bsc#1083646 and debug output in reconnect_s390 on S390-KVM added

Actions

Copy link

#31

Updated by okurz about 7 years ago

With https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4843 I added a call to help investigation of bootup problems on s390x z/VM sending magic-sysrq-w to show "blocked tasks". This can help in case we can not even login. This might be limited to some platforms where the rescue mode is not available/working. Still, for automatic investigation this can help helpful.

Actions

Copy link

#32

Updated by szarate about 7 years ago

Assignee deleted (~~szarate~~)

Actions

Copy link

#33

Updated by szarate about 7 years ago

@okurz: I wonder if poo#36601 is needed at all

Actions

Copy link

#34

Updated by okurz about 7 years ago

well, I think the whole memory dump feature is useless as long as we do not use it on a regular base, e.g. see the post_fail_hook of tests/installation/first_boot.pm in os-autoinst-distri-opensuse referencing https://progress.opensuse.org/issues/19390 and such. I think this could be revisited. Also, a command to respawn the VM shown to a bug assignee one way or another would be helpful. Of course, not just output the qemu command line from the autoinst-log.txt but loading a memory dump.

Actions

Copy link

#35

Updated by okurz over 5 years ago

Subject changed from [tools]Gather more system information and logs in case of boot/reboot times out to [tools] Gather more system information and logs in case of boot/reboot times out
Status changed from Feedback to Resolved
Assignee set to okurz

I guess it's ok if we keep the subticket #36601 (in the parent project) open and close this "test related" issue especially as since then we have even better logs and information, e.g. magic sysrq on unresponsive systems, etc.

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #14068

[tools] Gather more system information and logs in case of boot/reboot times out

observation¶

problem¶

suggestion¶

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by szarate over 8 years ago

Updated by szarate over 8 years ago

Updated by szarate over 8 years ago

Updated by szarate over 8 years ago

Updated by szarate over 8 years ago

Updated by szarate over 8 years ago

Updated by agraf@suse.de over 8 years ago

observation¶

problem¶

suggestion¶

Updated by szarate over 8 years ago

Updated by szarate over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by szarate over 8 years ago

Updated by maritawerner over 8 years ago

Updated by maritawerner over 8 years ago

Updated by maritawerner over 8 years ago

Updated by szarate over 8 years ago

Updated by okurz over 8 years ago

Updated by szarate over 8 years ago

Updated by szarate over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by szarate over 8 years ago

Updated by okurz over 8 years ago

Updated by okurz over 8 years ago

Updated by RBrownSUSE about 8 years ago

Updated by dzedro about 8 years ago

Updated by okurz about 8 years ago

Updated by okurz about 7 years ago

Updated by okurz about 7 years ago

Updated by szarate about 7 years ago

Updated by szarate about 7 years ago

Updated by okurz about 7 years ago

Updated by okurz over 5 years ago