action #103791: After module failure, the console is broken size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #103791

closed

After module failure, the console is broken size:M

Added by jlausuch over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-12-09

Due date:

% Done:

Estimated time:

Description

Observation¶

I have observed some situations when a module fails and openQA runs the next one, the very first command fails or times out.
Example:

In this job, after docker_compose failure, the following modules fail in the beginning.

More occurrences:
https://openqa.suse.de/tests/7806630#step/libvorbis/4
https://openqa.suse.de/tests/7810193#step/verify_default_target/4

Related Slack thread: https://suse.slack.com/archives/C02CANHLANP/p1639048127294800

Acceptance criteria¶

AC1: Better information exists about the state of the system after loading snapshots or in case of failures

Suggestions¶

Add a box to the test module that a snapshot was loaded and since the previous module failed it might affect the result e.g. due to I/O or the system clock being askew
After loading snapshots in os-autoinst use QEMU monitoring commands to find out whether the system is just busy/slow, see https://qemu-project.gitlab.io/qemu/system/monitor.html , e.g. "info status" and check if the system is just very busy or responsive. Other commands okurz recommends "info migrate" as we just load a snapshot before, maybe it's not completely finished? Maybe "info dirty_rate" shows if stuff needs to be handled before the system is properly responsive again?
The output of those commands could be used in simple debug log lines, so nothing more fancy required
Try to reproduce with a synthetic setup, could e.g. be part of the os-autoinst full-stack test

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by jlausuch over 3 years ago

Related to action #101295: [timebox: 8h][sporadic] test fails in verify_default_target added

Actions

Copy link

Updated by maritawerner over 3 years ago

@jlausuch is that a ticket for the yast team? Or more for the QE Core team? Or both?

Actions

Copy link

Updated by oorlov over 3 years ago

Marita, it looks like it is a ticket for qe-tools, as this is related to openQA itself. It is not something related to test code.

Actions

Copy link

Updated by jlausuch over 3 years ago

Exactly, it has nothing to do with Yast as it looks something related to openQA backend.

Actions

Copy link

Updated by jlausuch over 3 years ago

Another example: https://openqa.suse.de/tests/7820905#step/docker_firewall/5

Actions

Copy link

Updated by mkittler over 3 years ago

That's a recent regression, right? The first commit that comes to my mind is https://github.com/os-autoinst/os-autoinst/commit/d5eb330962dc9f13230af29e05eea7cefebd3124 as it affects failing jobs specifically.

Actions

Copy link

Updated by jlausuch over 3 years ago

mkittler wrote:

That's a recent regression, right? The first commit that comes to my mind is https://github.com/os-autoinst/os-autoinst/commit/d5eb330962dc9f13230af29e05eea7cefebd3124 as it affects failing jobs specifically.

Yes, I've been noticing about failures like this only recently.

Actions

Copy link

Updated by jlausuch over 3 years ago

Another failure that might be related:
https://openqa.suse.de/tests/7832724#step/btrfs_send_receive/2
Here, after snapper_cleanup fails, the next modules fail due to Failed to wait for login prompt.

Actions

Copy link

Updated by maritawerner over 3 years ago

Subject changed from After module failure, the console is broken to [qe-core] After module failure, the console is broken

Actions

Copy link

#10

Updated by maritawerner over 3 years ago

Subject changed from [qe-core] After module failure, the console is broken to After module failure, the console is broken

Actions

Copy link

#11

Updated by maritawerner over 3 years ago

Project changed from openQA Tests (public) to openQA Project (public)
Category deleted (~~Bugs in existing tests~~)

Actions

Copy link

#12

Updated by okurz over 3 years ago

Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

#13

Updated by mkittler over 3 years ago

Subject changed from After module failure, the console is broken to After module failure, the console is broken size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#14

Updated by jlausuch over 3 years ago

Another example that might be related: https://openqa.suse.de/tests/7871222#step/btrfs_qgroups/2

Actions

Copy link

#15

Updated by mkittler over 3 years ago

After loading snapshots in os-autoinst use QEMU monitoring commands …

Note that https://openqa.suse.de/tests/7806676#step/cifs/4 (the first job mentioned in the ticket description) is actually not using the QEMU backend. The svrit backend which is used here also supports snapshots so it could still be a performance problem when loading snapshots. However, QEMU commands won't always help.

Apparently, the problem can be reproduced quite reliably: https://openqa.suse.de/tests/7924937#next_previous - There was not even a single job in that recent history that was not affected.

I don't understand what's the problem in the libvorbis example as there's just a single failing module.

Actions

Copy link

#16

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: minimal+role_minimal
https://openqa.suse.de/tests/8029475

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Actions

Copy link

#17

Updated by okurz about 3 years ago

Priority changed from Normal to High

Treating as high due to the reminder comment so likely someone waits for this ticket to be resolved

Actions

Copy link

#18

Updated by livdywan about 3 years ago

Description updated (diff)

Discussed the ticket after the daily, and came up with a suggestion to visualize the snapshotting in the test module execution

Actions

Copy link

#19

Updated by livdywan about 3 years ago

Status changed from Workable to In Progress
Assignee set to livdywan

Asked for some pointers in the daily, as I wasn't sure wether to extend on the js or Perl side of openQA, and received the suggestion to really solve it in os-autoinst without special-casing

Actions

Copy link

#20

Updated by openqa_review about 3 years ago

Due date set to 2022-03-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#21

Updated by livdywan about 3 years ago

I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD

Actions

Copy link

#22

Updated by livdywan about 3 years ago

cdywan wrote:

I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD

I updated the original draft to implement snapshot visualization via record info files, and added autotest coverage.
Since I kept wading through deprecation messages I also proposed an orthogonal fix for that https://github.com/os-autoinst/os-autoinst/pull/1965
I prepared another branch which adds coverage for qemu-based snapshot logging as part of the fullstack test: https://github.com/os-autoinst/os-autoinst/pull/1966
Yet another branch addresses svirt-specific snapshot features - this is not totally obvious but there's quite a bit of backend-specific code meaning two jobs using snapshots can fail very differently if something goes wrong: https://github.com/os-autoinst/os-autoinst/pull/1967

Only the first one is required to resolve this ticket, but as I mentioned I wanted to confirm gaps in coverage while I'm working on this since I had to disambiguate different issues for myself anyway and this ticket gets linked to very different jobs.

Actions

Copy link

#23

Updated by livdywan about 3 years ago

cdywan wrote:

cdywan wrote:

I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD

I updated the original draft to implement snapshot visualization via record info files, and added autotest coverage.

Apparently I'm hitting confusing behaviors where $current_test is not defined when the lastgood snapshot gets loaded.

Actions

Copy link

#24

Updated by livdywan about 3 years ago

Due date changed from 2022-03-02 to 2022-03-11

cdywan wrote:

cdywan wrote:

cdywan wrote:

I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD

I updated the original draft to implement snapshot visualization via record info files, and added autotest coverage.

Apparently I'm hitting confusing behaviors where $current_test is not defined when the lastgood snapshot gets loaded.

Aiming to wrap this up next week, and getting some ideas from team members (to support our hackweek I didn't actively ask for help in the usual calls).

Actions

Copy link

#25

Updated by jlausuch about 3 years ago

I don't see it too often in JeOS test, but today I noticed:
https://openqa.suse.de/tests/8285438#step/btrfs_qgroups/2

After a timeout failure in btrfs_autocompletion, all the following modules fail with
Test died: Failed to wait for login prompt at sle/lib/serial_terminal.pm line 114.

Actions

Copy link

#26

Updated by livdywan about 3 years ago

New, cleaner approach to show the snapshot loading in tests https://github.com/os-autoinst/os-autoinst/pull/1987

Actions

Copy link

#27

Updated by livdywan about 3 years ago

Status changed from In Progress to Feedback

cdywan wrote:

New, cleaner approach to show the snapshot loading in tests https://github.com/os-autoinst/os-autoinst/pull/1987

Merged

Actions

Copy link

#28

Updated by okurz about 3 years ago

I deployed the change on openqaworker7 and triggered a specific test job:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/t2236452 _GROUP=0 BUILD=okurz_poo103791 TEST=krypton-live-okurz_poo103791 EXCLUDE_MODULES=systemsettings5,dolphin,konsole,desktop_mainmenu,kontact,shutdown WORKER_CLASS=openqaworker7

this should fail in firefox (at least the job template failed), load a snapshot, show a test module result with the information that a snapshot was loaded, and should then fail again in kate and load a snapshot again but don't show a box because there is no next test module.

Created job #2236467: opensuse-Tumbleweed-Krypton-Live-x86_64-Build4.33-krypton-live@USBboot_64-2G -> https://openqa.opensuse.org/t2236467

Actions

Copy link

#29

Updated by okurz about 3 years ago

https://openqa.opensuse.org/tests/2236467#step/firefox_audio/1 shows the expected snapshot loading icon. Though I wonder why there is none in https://openqa.opensuse.org/tests/2236467#step/system_prepare/1 . I suggest to crosscheck the difference in algorithm for milestone and non-milestone.

Actions

Copy link

#30

Updated by livdywan about 3 years ago

okurz wrote:

https://openqa.opensuse.org/tests/2236467#step/firefox_audio/1 shows the expected snapshot loading icon. Though I wonder why there is none in https://openqa.opensuse.org/tests/2236467#step/system_prepare/1 . I suggest to crosscheck the difference in algorithm for milestone and non-milestone.

This looks to me like (not) being in the same category makes the difference, although I can't back up that observation with code 🤔️

Actions

Copy link

#31

Updated by okurz about 3 years ago

Due date deleted (~~2022-03-11~~)
Status changed from Feedback to Resolved

yeah, let's just pretend we have not seen this problem and call the problem done :)

For everyone, please keep in mind that this will not fix specific problems in test code that still need to be fixed individually. This is only making it more clear where a snapshot was loaded with the potential impact that can have.

Actions

Copy link

#32

Updated by jlausuch about 3 years ago

https://openqa.suse.de/tests/8321241
https://openqa.suse.de/tests/8322911
https://openqa.suse.de/tests/8322915

Actions

Copy link

#33

Updated by jlausuch about 3 years ago

Related to action #108064: Test fails in btrfs_autocompletion - System management is locked by the application with pid 1658 (zypper). added

Actions

Copy link

#34

Updated by livdywan about 3 years ago

jlausuch wrote:

https://openqa.suse.de/tests/8321241
https://openqa.suse.de/tests/8322911
https://openqa.suse.de/tests/8322915

All of these show the snapshot loading after btrfs_autocompletion failed, including sunsequent modules because testapi::select_console("root-virtio-terminal") gets stuck. So it looks to me like a consequence of #108064.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #103791

After module failure, the console is broken size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by jlausuch over 3 years ago

Updated by maritawerner over 3 years ago

Updated by oorlov over 3 years ago

Updated by jlausuch over 3 years ago

Updated by jlausuch over 3 years ago

Updated by mkittler over 3 years ago

Updated by jlausuch over 3 years ago

Updated by jlausuch over 3 years ago

Updated by maritawerner over 3 years ago

Updated by maritawerner over 3 years ago

Updated by maritawerner over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by jlausuch over 3 years ago

Updated by mkittler over 3 years ago

Updated by openqa_review about 3 years ago

Updated by okurz about 3 years ago

Updated by livdywan about 3 years ago

Updated by livdywan about 3 years ago

Updated by openqa_review about 3 years ago

Updated by livdywan about 3 years ago

Updated by livdywan about 3 years ago

Updated by livdywan about 3 years ago

Updated by livdywan about 3 years ago

Updated by jlausuch about 3 years ago

Updated by livdywan about 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by jlausuch about 3 years ago

Updated by jlausuch about 3 years ago

Updated by livdywan about 3 years ago