Project

General

Profile

Actions

action #103791

closed

After module failure, the console is broken size:M

Added by jlausuch over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-12-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

I have observed some situations when a module fails and openQA runs the next one, the very first command fails or times out.
Example:

In this job, after docker_compose failure, the following modules fail in the beginning.

More occurrences:
https://openqa.suse.de/tests/7806630#step/libvorbis/4
https://openqa.suse.de/tests/7810193#step/verify_default_target/4

Related Slack thread: https://suse.slack.com/archives/C02CANHLANP/p1639048127294800

Acceptance criteria

  • AC1: Better information exists about the state of the system after loading snapshots or in case of failures

Suggestions

  • Add a box to the test module that a snapshot was loaded and since the previous module failed it might affect the result e.g. due to I/O or the system clock being askew
  • After loading snapshots in os-autoinst use QEMU monitoring commands to find out whether the system is just busy/slow, see https://qemu-project.gitlab.io/qemu/system/monitor.html , e.g. "info status" and check if the system is just very busy or responsive. Other commands okurz recommends "info migrate" as we just load a snapshot before, maybe it's not completely finished? Maybe "info dirty_rate" shows if stuff needs to be handled before the system is properly responsive again?
  • The output of those commands could be used in simple debug log lines, so nothing more fancy required
  • Try to reproduce with a synthetic setup, could e.g. be part of the os-autoinst full-stack test

Related issues 1 (0 open1 closed)

Related to qe-yam - action #101295: [timebox: 8h][sporadic] test fails in verify_default_targetRejected2021-10-21

Actions
Actions #1

Updated by jlausuch over 2 years ago

  • Related to action #101295: [timebox: 8h][sporadic] test fails in verify_default_target added
Actions #2

Updated by maritawerner over 2 years ago

@jlausuch is that a ticket for the yast team? Or more for the QE Core team? Or both?

Actions #3

Updated by oorlov over 2 years ago

Marita, it looks like it is a ticket for qe-tools, as this is related to openQA itself. It is not something related to test code.

Actions #4

Updated by jlausuch over 2 years ago

Exactly, it has nothing to do with Yast as it looks something related to openQA backend.

Actions #6

Updated by mkittler over 2 years ago

That's a recent regression, right? The first commit that comes to my mind is https://github.com/os-autoinst/os-autoinst/commit/d5eb330962dc9f13230af29e05eea7cefebd3124 as it affects failing jobs specifically.

Actions #7

Updated by jlausuch over 2 years ago

mkittler wrote:

That's a recent regression, right? The first commit that comes to my mind is https://github.com/os-autoinst/os-autoinst/commit/d5eb330962dc9f13230af29e05eea7cefebd3124 as it affects failing jobs specifically.

Yes, I've been noticing about failures like this only recently.

Actions #8

Updated by jlausuch over 2 years ago

Another failure that might be related:
https://openqa.suse.de/tests/7832724#step/btrfs_send_receive/2
Here, after snapper_cleanup fails, the next modules fail due to Failed to wait for login prompt.

Actions #9

Updated by maritawerner over 2 years ago

  • Subject changed from After module failure, the console is broken to [qe-core] After module failure, the console is broken
Actions #10

Updated by maritawerner over 2 years ago

  • Subject changed from [qe-core] After module failure, the console is broken to After module failure, the console is broken
Actions #11

Updated by maritawerner over 2 years ago

  • Project changed from openQA Tests to openQA Project
  • Category deleted (Bugs in existing tests)
Actions #12

Updated by okurz over 2 years ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #13

Updated by mkittler over 2 years ago

  • Subject changed from After module failure, the console is broken to After module failure, the console is broken size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #14

Updated by jlausuch over 2 years ago

Another example that might be related: https://openqa.suse.de/tests/7871222#step/btrfs_qgroups/2

Actions #15

Updated by mkittler over 2 years ago

After loading snapshots in os-autoinst use QEMU monitoring commands …

Note that https://openqa.suse.de/tests/7806676#step/cifs/4 (the first job mentioned in the ticket description) is actually not using the QEMU backend. The svrit backend which is used here also supports snapshots so it could still be a performance problem when loading snapshots. However, QEMU commands won't always help.

Apparently, the problem can be reproduced quite reliably: https://openqa.suse.de/tests/7924937#next_previous - There was not even a single job in that recent history that was not affected.


I don't understand what's the problem in the libvorbis example as there's just a single failing module.

Actions #16

Updated by openqa_review about 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: minimal+role_minimal
https://openqa.suse.de/tests/8029475

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions #17

Updated by okurz about 2 years ago

  • Priority changed from Normal to High

Treating as high due to the reminder comment so likely someone waits for this ticket to be resolved

Actions #18

Updated by livdywan about 2 years ago

  • Description updated (diff)

Discussed the ticket after the daily, and came up with a suggestion to visualize the snapshotting in the test module execution

Actions #19

Updated by livdywan about 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

Asked for some pointers in the daily, as I wasn't sure wether to extend on the js or Perl side of openQA, and received the suggestion to really solve it in os-autoinst without special-casing

Actions #20

Updated by openqa_review about 2 years ago

  • Due date set to 2022-03-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #21

Updated by livdywan about 2 years ago

I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD

Actions #22

Updated by livdywan about 2 years ago

cdywan wrote:

I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD

Only the first one is required to resolve this ticket, but as I mentioned I wanted to confirm gaps in coverage while I'm working on this since I had to disambiguate different issues for myself anyway and this ticket gets linked to very different jobs.

Actions #23

Updated by livdywan about 2 years ago

cdywan wrote:

cdywan wrote:

I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD

  • I updated the original draft to implement snapshot visualization via record info files, and added autotest coverage.

Apparently I'm hitting confusing behaviors where $current_test is not defined when the lastgood snapshot gets loaded.

Actions #24

Updated by livdywan about 2 years ago

  • Due date changed from 2022-03-02 to 2022-03-11

cdywan wrote:

cdywan wrote:

cdywan wrote:

I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD

  • I updated the original draft to implement snapshot visualization via record info files, and added autotest coverage.

Apparently I'm hitting confusing behaviors where $current_test is not defined when the lastgood snapshot gets loaded.

Aiming to wrap this up next week, and getting some ideas from team members (to support our hackweek I didn't actively ask for help in the usual calls).

Actions #25

Updated by jlausuch about 2 years ago

I don't see it too often in JeOS test, but today I noticed:
https://openqa.suse.de/tests/8285438#step/btrfs_qgroups/2

After a timeout failure in btrfs_autocompletion, all the following modules fail with
Test died: Failed to wait for login prompt at sle/lib/serial_terminal.pm line 114.

Actions #26

Updated by livdywan about 2 years ago

New, cleaner approach to show the snapshot loading in tests https://github.com/os-autoinst/os-autoinst/pull/1987

Actions #27

Updated by livdywan about 2 years ago

  • Status changed from In Progress to Feedback

cdywan wrote:

New, cleaner approach to show the snapshot loading in tests https://github.com/os-autoinst/os-autoinst/pull/1987

Merged

Actions #28

Updated by okurz about 2 years ago

I deployed the change on openqaworker7 and triggered a specific test job:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/t2236452 _GROUP=0 BUILD=okurz_poo103791 TEST=krypton-live-okurz_poo103791 EXCLUDE_MODULES=systemsettings5,dolphin,konsole,desktop_mainmenu,kontact,shutdown WORKER_CLASS=openqaworker7

this should fail in firefox (at least the job template failed), load a snapshot, show a test module result with the information that a snapshot was loaded, and should then fail again in kate and load a snapshot again but don't show a box because there is no next test module.

Created job #2236467: opensuse-Tumbleweed-Krypton-Live-x86_64-Build4.33-krypton-live@USBboot_64-2G -> https://openqa.opensuse.org/t2236467

Actions #29

Updated by okurz about 2 years ago

https://openqa.opensuse.org/tests/2236467#step/firefox_audio/1 shows the expected snapshot loading icon. Though I wonder why there is none in https://openqa.opensuse.org/tests/2236467#step/system_prepare/1 . I suggest to crosscheck the difference in algorithm for milestone and non-milestone.

Actions #30

Updated by livdywan about 2 years ago

okurz wrote:

https://openqa.opensuse.org/tests/2236467#step/firefox_audio/1 shows the expected snapshot loading icon. Though I wonder why there is none in https://openqa.opensuse.org/tests/2236467#step/system_prepare/1 . I suggest to crosscheck the difference in algorithm for milestone and non-milestone.

This looks to me like (not) being in the same category makes the difference, although I can't back up that observation with code 🤔️

Actions #31

Updated by okurz about 2 years ago

  • Due date deleted (2022-03-11)
  • Status changed from Feedback to Resolved

yeah, let's just pretend we have not seen this problem and call the problem done :)

For everyone, please keep in mind that this will not fix specific problems in test code that still need to be fixed individually. This is only making it more clear where a snapshot was loaded with the potential impact that can have.

Actions #34

Updated by livdywan about 2 years ago

jlausuch wrote:

https://openqa.suse.de/tests/8321241
https://openqa.suse.de/tests/8322911
https://openqa.suse.de/tests/8322915

All of these show the snapshot loading after btrfs_autocompletion failed, including sunsequent modules because testapi::select_console("root-virtio-terminal") gets stuck. So it looks to me like a consequence of #108064.

Actions

Also available in: Atom PDF