Project

General

Profile

Actions

action #14068

closed

[tools] Gather more system information and logs in case of boot/reboot times out

Added by okurz about 8 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Enhancement to existing tests
Target version:
-
Start date:
2016-09-23
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

observation

For example https://openqa.suse.de/tests/600788#step/zypper_migration/9 fails after migration. In "first_boot" we already support some error handling to press "esc" but we need it also here and in reboot_gnome and also in case linuxrc does not boot up and is stuck in progress bar, see bsc#999231.

problem

As online migration fails often in current cases we want this urgently.
Gathering logs is not easy as the system can stop in very different steps and also in many cases does not allow to log into an existing shell (e.g. stuck during boot).

suggestion

  • Press "esc" in case of boot/reboot times out to gather some console information as we already do in "first_boot"
  • Instruct qemu backends to do a memory dump and save it as we do for logs
    • Add qemu backend support to save memory dump
    • In post_fail_hook of corresponding tests (start with first_boot) trigger the memdump
    • Save the memdump to be accessible
    • Make sure the size of memdump is not too big (e.g. < 2MB) as we have like 1000 failing tests each day and not infinite disk space

Checklist

  • Have a mock method in backend baseclass so memory dump method can be called safely without crashing
  • Have qemu backend to support memory dumps.
  • Have the WebUI/Worker to upload the memory dumps.
  • Have the WebUI display the memory dumps
  • Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed
  • Have the webUI to display the command line needed to respawn the VM.

Subtasks 6 (0 open6 closed)

action #15170: boot_to_desktop should use same error analysis approach as first_bootResolvedokurz2016-11-30

Actions
action #13874: [Build 2141] test reboot_gnome fails in reboot, should press "esc" to show detailsResolvedmkravec2016-09-23

Actions
action #16286: online_migration_setup should use same emergency handling as first_bootResolvedmichalnowak2017-01-27

Actions
action #17638: [migration] online_migration_setup should use some error investigation like e.g. first_boot, reboot_gnome, etc.Resolvedqmsu2017-03-09

Actions
action #17196: [tw][gnome-live] test fails in reboot_gnome to show logout dialogResolveddimstar2017-02-19

Actions
openQA Project (public) - action #36601: Display the command to spawn a VM for virtualization backendsRejectedokurz2018-05-28

Actions

Related issues 6 (1 open5 closed)

Related to openQA Tests (public) - action #14086: [Build 2160] test zypper_patch fails to reboot on ppc64leResolvedmitiao2016-10-06

Actions
Related to openQA Tests (public) - action #13896: collect linuxrc logs on installation startup problems / turn off plymouth to debug startup problemsResolvedokurz2016-09-26

Actions
Related to openQA Project (public) - action #12246: [tools]upload of log files can fail sometimes (was: https://openqa.suse.de/tests/412464 has no X-related log files)Resolvedszarate2016-06-07

Actions
Related to openQA Project (public) - action #12836: preserve disk image / virtual machine / keep them running in case of failures on demandWorkable2016-07-24

Actions
Related to openQA Tests (public) - action #16520: [qam][opensuse][sle][functional] enhance logging and debugging in case of failed shutdown, e.g. press 'esc' on plymouth splash screenResolvednicksinger2017-02-062017-11-08

Actions
Related to openQA Tests (public) - action #34609: [sle][functional][u][medium] Improve Implementation of workaround for bsc#1083646 and debug output in reconnect_s390 on S390-KVMRejectedmgriessmeier2018-04-102018-04-24

Actions
Actions #1

Updated by okurz about 8 years ago

  • Subject changed from Press "esc" in case of boot/reboot times out to gather some console information to Gather more system information and logs in case of boot/reboot times out
  • Description updated (diff)
Actions #2

Updated by okurz about 8 years ago

  • Related to action #14086: [Build 2160] test zypper_patch fails to reboot on ppc64le added
Actions #3

Updated by szarate about 8 years ago

  • Checklist item changed from to [ ] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [ ] Have qemu backend to support memory dumps., [ ] Have the WebUI/Worker to upload the memory dumps.
  • Status changed from New to In Progress
  • Assignee set to szarate
  • Start date changed from 2016-10-05 to 2016-10-10
Actions #4

Updated by szarate about 8 years ago

  • Checklist item changed from to [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing
Actions #5

Updated by szarate about 8 years ago

  • Checklist item changed from to [x] Have qemu backend to support memory dumps.
Actions #6

Updated by szarate about 8 years ago

  • Checklist item changed from to [x] Have the WebUI/Worker to upload the memory dumps.
Actions #7

Updated by szarate about 8 years ago

  • Checklist item changed from [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps. to [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps., [ ] Have the WebUI display the memory dumps, [ ] Add on the exact line needed to restore
Actions #8

Updated by szarate about 8 years ago

  • Checklist item changed from to [x] Have the WebUI display the memory dumps
Actions #9

Updated by agraf@suse.de about 8 years ago

On 10/18/2016 01:15 PM, redmine@opensuse.org wrote:

[openSUSE Tracker]
Issue #14068 has been updated by szarate.

Checklist set to [x] Have the WebUI display the memory dumps


action #14068: Gather more system information and logs in case of boot/reboot times out
https://progress.opensuse.org/issues/14068#change-30078

  • Author: okurz
  • Status: In Progress
  • Priority: Urgent
  • Assignee: szarate
  • Category: Enhancement to existing tests

* Target version:

observation

For example https://openqa.suse.de/tests/600788#step/zypper_migration/9 fails after migration. In "first_boot" we already support some error handling to press "esc" but we need it also here and in reboot_gnome and also in case linuxrc does not boot up and is stuck in progress bar, see bsc#999231.

problem

As online migration fails often in current cases we want this urgently.
Gathering logs is not easy as the system can stop in very different steps and also in many cases does not allow to log into an existing shell (e.g. stuck during boot).

suggestion

  • Press "esc" in case of boot/reboot times out to gather some console information as we already do in "first_boot"
  • Instruct qemu backends to do a memory dump and save it as we do for logs
    • Add qemu backend support to save memory dump
    • In post_fail_hook of corresponding tests (start with first_boot) trigger the memdump
    • Save the memdump to be accessible
    • Make sure the size of memdump is not too big (e.g. < 2MB) as we have like 1000 failing tests each day and not infinite disk space

A "normal" migration stream should be ~400MB. With 1000 failures a day
that means <400GB of data for a day. So with 1000 failing tests per day
and a 4TB disk (which is ~$150) you can easily store 10 days worth of
failures. If you run out of disk space, FIFO delete the old dumps...

If you also want to save disk images in parallel, expect dumps to use up
maybe ~4GB. So with 1000 failures you still get 1 day worth of failures
on that disk. Just get yourself a system with 10 disks (read: <$5k) and
you're back to 10 days worth of failures.

We're talking about storage sizes here that really shouldn't be a
problem. If we save two developers one week of debugging each we've
already created a net win.

Alex

Actions #10

Updated by szarate about 8 years ago

  • Checklist item changed from [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps., [x] Have the WebUI display the memory dumps, [ ] Add on the exact line needed to restore to [x] Have a mock method in backend baseclass so memory dump method can be called safely without crashing, [x] Have qemu backend to support memory dumps., [x] Have the WebUI/Worker to upload the memory dumps., [x] Have the WebUI display the memory dumps, [ ] Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed, [ ] Have the webUI to display the command line needed to respawn the VM.
Actions #11

Updated by szarate about 8 years ago

  • Status changed from In Progress to Feedback

The feature on the backend side can be considered ready. WebUI house keeping is still missing (Which would address Alexander's concerns).

There's currently a bug, that triggers when there's nothing on the pool directory, and leaves the machine migration stalled. I still need to hunt that bug down.

Actions #12

Updated by okurz about 8 years ago

https://bugzilla.suse.com/show_bug.cgi?id=1005883 is a bug report using the memory dump and the disk image. So, how can this memory dump be used?

What I did:

  • trigger https://openqa.suse.de/tests/621517 and enable interactive mode, wait until it got stuck trying to reboot
  • as there was no ulogs directory yet in the pools directory I created one with sudo -u _openqa-worker mkdir -p /var/lib/openqa/pool/8/ulogs
  • login to openqaworker2, look in the process table for the telnet port of the qemu instance
  • connect with telnet to port
  • call exec:gzip -c > ulogs/t1234-vm-memory-dump.gz
  • also I saved the disk image with cp -a /var/lib/openqa/pool/8/raid/1 /var/lib/openqa/pool/8/ulogs/disk_image_hung_in_shutdown_before_reboot.qcow2
  • let the test fail and therefore upload everything under the ulogs directory to the webui
Actions #13

Updated by okurz about 8 years ago

why are you setting it to feedback then? If you have questions I suggest you ask them :-) Otherwise, you can also keep the ticket in state "In Progress" and unassign if you can't continue on it.

The PR for the os-autoinst change is: https://github.com/os-autoinst/os-autoinst/pull/621

Actions #14

Updated by szarate about 8 years ago

I have filled bsc#1008148 which is actually the reason why the memory dumps were being stalled.

Looks like currently a migration after a snapshot has been created, is impossible, as the snapshot part of the code, is not cleaning up the migration state, therefore rendering any other migration performed by the user, unable to be performed, This commit solves the problem.

Actions #15

Updated by maritawerner about 8 years ago

  • Related to action #13896: collect linuxrc logs on installation startup problems / turn off plymouth to debug startup problems added
Actions #16

Updated by maritawerner about 8 years ago

  • Related to action #12246: [tools]upload of log files can fail sometimes (was: https://openqa.suse.de/tests/412464 has no X-related log files) added
Actions #17

Updated by maritawerner about 8 years ago

Added link: Related to action #12246: https://openqa.suse.de/tests/412464 has no X-related log files

Actions #18

Updated by szarate about 8 years ago

@maritawerner, @okurz i belive that #12246 is more related to #14902 and it's solution than this one.

Actions #19

Updated by okurz about 8 years ago

No, because #14902 is about "no proper log at all". and #12246 is about logs in specific cases. Keep in mind that both this ticket here and #12246 are from tests reviewers perspective, #14902 is an openQA or backend issue relevant for admins of the test infrastructure, e.g. why the communication between webui and worker breaks.

Actions #20

Updated by szarate about 8 years ago

  • Status changed from Feedback to Resolved

Well BSC#1008148 has been marked as resolved, we now can safely roll with this.

I have a WIP for the last two items, will work on this later on, but currently the documentation is enough to get started to use the feature.

Actions #21

Updated by szarate about 8 years ago

  • Status changed from Resolved to Feedback

Whops :)

Actions #22

Updated by okurz about 8 years ago

@szarate can you help with actually using these feature and also help with "simpler" tasks to get our test failure investigation in better shape?: E.g. see #15170 and other subtickets of this one.

oh, and a hint regarding bugs in RESOLVED FIXED. We as QA should tend to set it to VERIFIED FIXED if we can actually verify the bugfix in our products, i.e. after a build includes this.

Actions #23

Updated by okurz almost 8 years ago

szarate, updates on this? As I can see from the checklist there are still two tasks although I think "Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed" should be done.

Actions #24

Updated by szarate almost 8 years ago

As we have now the qemu fixed, i think it'll be a good time to add this... i might work on this during this week

Actions #25

Updated by okurz almost 8 years ago

  • Related to action #12836: preserve disk image / virtual machine / keep them running in case of failures on demand added
Actions #26

Updated by okurz almost 8 years ago

  • Checklist item changed from to [x] Have the WebUI register and handle Memory dumps, disk files so that the gru can run cleanups when needed
Actions #27

Updated by RBrownSUSE almost 8 years ago

  • Subject changed from Gather more system information and logs in case of boot/reboot times out to [tools]Gather more system information and logs in case of boot/reboot times out
Actions #28

Updated by dzedro almost 8 years ago

Memory dump died, DIE Migration failed: desc: There's a migration process in progress, class: GenericError, stopped at /usr/lib/os-autoinst/backend/qemu.pm line 169.
https://openqa.suse.de/tests/815304

Actions #29

Updated by okurz over 7 years ago

  • Related to action #16520: [qam][opensuse][sle][functional] enhance logging and debugging in case of failed shutdown, e.g. press 'esc' on plymouth splash screen added
Actions #30

Updated by okurz over 6 years ago

  • Related to action #34609: [sle][functional][u][medium] Improve Implementation of workaround for bsc#1083646 and debug output in reconnect_s390 on S390-KVM added
Actions #31

Updated by okurz over 6 years ago

With https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/4843 I added a call to help investigation of bootup problems on s390x z/VM sending magic-sysrq-w to show "blocked tasks". This can help in case we can not even login. This might be limited to some platforms where the rescue mode is not available/working. Still, for automatic investigation this can help helpful.

Actions #32

Updated by szarate over 6 years ago

  • Assignee deleted (szarate)
Actions #33

Updated by szarate over 6 years ago

@okurz: I wonder if poo#36601 is needed at all

Actions #34

Updated by okurz over 6 years ago

well, I think the whole memory dump feature is useless as long as we do not use it on a regular base, e.g. see the post_fail_hook of tests/installation/first_boot.pm in os-autoinst-distri-opensuse referencing https://progress.opensuse.org/issues/19390 and such. I think this could be revisited. Also, a command to respawn the VM shown to a bug assignee one way or another would be helpful. Of course, not just output the qemu command line from the autoinst-log.txt but loading a memory dump.

Actions #35

Updated by okurz almost 5 years ago

  • Subject changed from [tools]Gather more system information and logs in case of boot/reboot times out to [tools] Gather more system information and logs in case of boot/reboot times out
  • Status changed from Feedback to Resolved
  • Assignee set to okurz

I guess it's ok if we keep the subticket #36601 (in the parent project) open and close this "test related" issue especially as since then we have even better logs and information, e.g. magic sysrq on unresponsive systems, etc.

Actions

Also available in: Atom PDF