action #103791
closedAfter module failure, the console is broken size:M
Description
Observation¶
I have observed some situations when a module fails and openQA runs the next one, the very first command fails or times out.
Example:
In this job, after docker_compose failure, the following modules fail in the beginning.
More occurrences:
https://openqa.suse.de/tests/7806630#step/libvorbis/4
https://openqa.suse.de/tests/7810193#step/verify_default_target/4
Related Slack thread: https://suse.slack.com/archives/C02CANHLANP/p1639048127294800
Acceptance criteria¶
- AC1: Better information exists about the state of the system after loading snapshots or in case of failures
Suggestions¶
- Add a box to the test module that a snapshot was loaded and since the previous module failed it might affect the result e.g. due to I/O or the system clock being askew
- After loading snapshots in os-autoinst use QEMU monitoring commands to find out whether the system is just busy/slow, see https://qemu-project.gitlab.io/qemu/system/monitor.html , e.g. "info status" and check if the system is just very busy or responsive. Other commands okurz recommends "info migrate" as we just load a snapshot before, maybe it's not completely finished? Maybe "info dirty_rate" shows if stuff needs to be handled before the system is properly responsive again?
- The output of those commands could be used in simple debug log lines, so nothing more fancy required
- Try to reproduce with a synthetic setup, could e.g. be part of the os-autoinst full-stack test
Updated by jlausuch about 3 years ago
- Related to action #101295: [timebox: 8h][sporadic] test fails in verify_default_target added
Updated by maritawerner about 3 years ago
@jlausuch is that a ticket for the yast team? Or more for the QE Core team? Or both?
Updated by oorlov about 3 years ago
Marita, it looks like it is a ticket for qe-tools, as this is related to openQA itself. It is not something related to test code.
Updated by jlausuch about 3 years ago
Exactly, it has nothing to do with Yast as it looks something related to openQA backend.
Updated by jlausuch about 3 years ago
Another example: https://openqa.suse.de/tests/7820905#step/docker_firewall/5
Updated by mkittler about 3 years ago
That's a recent regression, right? The first commit that comes to my mind is https://github.com/os-autoinst/os-autoinst/commit/d5eb330962dc9f13230af29e05eea7cefebd3124 as it affects failing jobs specifically.
Updated by jlausuch about 3 years ago
mkittler wrote:
That's a recent regression, right? The first commit that comes to my mind is https://github.com/os-autoinst/os-autoinst/commit/d5eb330962dc9f13230af29e05eea7cefebd3124 as it affects failing jobs specifically.
Yes, I've been noticing about failures like this only recently.
Updated by jlausuch about 3 years ago
Another failure that might be related:
https://openqa.suse.de/tests/7832724#step/btrfs_send_receive/2
Here, after snapper_cleanup
fails, the next modules fail due to Failed to wait for login prompt
.
Updated by maritawerner about 3 years ago
- Subject changed from After module failure, the console is broken to [qe-core] After module failure, the console is broken
Updated by maritawerner about 3 years ago
- Subject changed from [qe-core] After module failure, the console is broken to After module failure, the console is broken
Updated by maritawerner about 3 years ago
- Project changed from openQA Tests (public) to openQA Project (public)
- Category deleted (
Bugs in existing tests)
Updated by okurz about 3 years ago
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by mkittler about 3 years ago
- Subject changed from After module failure, the console is broken to After module failure, the console is broken size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by jlausuch about 3 years ago
Another example that might be related: https://openqa.suse.de/tests/7871222#step/btrfs_qgroups/2
Updated by mkittler almost 3 years ago
After loading snapshots in os-autoinst use QEMU monitoring commands …
Note that https://openqa.suse.de/tests/7806676#step/cifs/4 (the first job mentioned in the ticket description) is actually not using the QEMU backend. The svrit backend which is used here also supports snapshots so it could still be a performance problem when loading snapshots. However, QEMU commands won't always help.
Apparently, the problem can be reproduced quite reliably: https://openqa.suse.de/tests/7924937#next_previous - There was not even a single job in that recent history that was not affected.
I don't understand what's the problem in the libvorbis example as there's just a single failing module.
Updated by openqa_review almost 3 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: minimal+role_minimal
https://openqa.suse.de/tests/8029475
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Updated by okurz almost 3 years ago
- Priority changed from Normal to High
Treating as high due to the reminder comment so likely someone waits for this ticket to be resolved
Updated by livdywan almost 3 years ago
- Description updated (diff)
Discussed the ticket after the daily, and came up with a suggestion to visualize the snapshotting in the test module execution
Updated by livdywan almost 3 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
Asked for some pointers in the daily, as I wasn't sure wether to extend on the js or Perl side of openQA, and received the suggestion to really solve it in os-autoinst without special-casing
Updated by openqa_review almost 3 years ago
- Due date set to 2022-03-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan almost 3 years ago
I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD
Updated by livdywan almost 3 years ago
cdywan wrote:
I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD
- I updated the original draft to implement snapshot visualization via record info files, and added autotest coverage.
- Since I kept wading through deprecation messages I also proposed an orthogonal fix for that https://github.com/os-autoinst/os-autoinst/pull/1965
- I prepared another branch which adds coverage for qemu-based snapshot logging as part of the fullstack test: https://github.com/os-autoinst/os-autoinst/pull/1966
- Yet another branch addresses svirt-specific snapshot features - this is not totally obvious but there's quite a bit of backend-specific code meaning two jobs using snapshots can fail very differently if something goes wrong: https://github.com/os-autoinst/os-autoinst/pull/1967
Only the first one is required to resolve this ticket, but as I mentioned I wanted to confirm gaps in coverage while I'm working on this since I had to disambiguate different issues for myself anyway and this ticket gets linked to very different jobs.
Updated by livdywan almost 3 years ago
cdywan wrote:
cdywan wrote:
I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD
- I updated the original draft to implement snapshot visualization via record info files, and added autotest coverage.
Apparently I'm hitting confusing behaviors where $current_test
is not defined when the lastgood
snapshot gets loaded.
Updated by livdywan almost 3 years ago
- Due date changed from 2022-03-02 to 2022-03-11
cdywan wrote:
cdywan wrote:
cdywan wrote:
I prepared a draft against os-autoinst, confirming the virtual lack of test coverage related to snapshots to start with. Next step adding my proposed fix, but also tests because I'm into TDD
- I updated the original draft to implement snapshot visualization via record info files, and added autotest coverage.
Apparently I'm hitting confusing behaviors where
$current_test
is not defined when thelastgood
snapshot gets loaded.
Aiming to wrap this up next week, and getting some ideas from team members (to support our hackweek I didn't actively ask for help in the usual calls).
Updated by jlausuch almost 3 years ago
I don't see it too often in JeOS test, but today I noticed:
https://openqa.suse.de/tests/8285438#step/btrfs_qgroups/2
After a timeout failure in btrfs_autocompletion, all the following modules fail with
Test died: Failed to wait for login prompt at sle/lib/serial_terminal.pm line 114.
Updated by livdywan almost 3 years ago
New, cleaner approach to show the snapshot loading in tests https://github.com/os-autoinst/os-autoinst/pull/1987
Updated by livdywan almost 3 years ago
- Status changed from In Progress to Feedback
cdywan wrote:
New, cleaner approach to show the snapshot loading in tests https://github.com/os-autoinst/os-autoinst/pull/1987
Merged
Updated by okurz almost 3 years ago
I deployed the change on openqaworker7 and triggered a specific test job:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.opensuse.org/t2236452 _GROUP=0 BUILD=okurz_poo103791 TEST=krypton-live-okurz_poo103791 EXCLUDE_MODULES=systemsettings5,dolphin,konsole,desktop_mainmenu,kontact,shutdown WORKER_CLASS=openqaworker7
this should fail in firefox (at least the job template failed), load a snapshot, show a test module result with the information that a snapshot was loaded, and should then fail again in kate and load a snapshot again but don't show a box because there is no next test module.
Created job #2236467: opensuse-Tumbleweed-Krypton-Live-x86_64-Build4.33-krypton-live@USBboot_64-2G -> https://openqa.opensuse.org/t2236467
Updated by okurz almost 3 years ago
https://openqa.opensuse.org/tests/2236467#step/firefox_audio/1 shows the expected snapshot loading icon. Though I wonder why there is none in https://openqa.opensuse.org/tests/2236467#step/system_prepare/1 . I suggest to crosscheck the difference in algorithm for milestone and non-milestone.
Updated by livdywan almost 3 years ago
okurz wrote:
https://openqa.opensuse.org/tests/2236467#step/firefox_audio/1 shows the expected snapshot loading icon. Though I wonder why there is none in https://openqa.opensuse.org/tests/2236467#step/system_prepare/1 . I suggest to crosscheck the difference in algorithm for milestone and non-milestone.
This looks to me like (not) being in the same category makes the difference, although I can't back up that observation with code 🤔️
Updated by okurz almost 3 years ago
- Due date deleted (
2022-03-11) - Status changed from Feedback to Resolved
yeah, let's just pretend we have not seen this problem and call the problem done :)
For everyone, please keep in mind that this will not fix specific problems in test code that still need to be fixed individually. This is only making it more clear where a snapshot was loaded with the potential impact that can have.
Updated by jlausuch almost 3 years ago
- Related to action #108064: Test fails in btrfs_autocompletion - System management is locked by the application with pid 1658 (zypper). added
Updated by livdywan almost 3 years ago
jlausuch wrote:
https://openqa.suse.de/tests/8321241
https://openqa.suse.de/tests/8322911
https://openqa.suse.de/tests/8322915
All of these show the snapshot loading after btrfs_autocompletion failed, including sunsequent modules because testapi::select_console("root-virtio-terminal")
gets stuck. So it looks to me like a consequence of #108064.