action #27004
closed[opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame
0%
Description
In yast2_gui test suite we have random failures most of which appear due to yast module not being able to start.
We have default timeout of 60 seconds, which is not always enough. We also tried to bump all timeouts using TIMEOUT_SCALE variable, which helped, but has not resolved the problem.
Hence, we need to investigate what causes it.
One of the causes is that btrfs balancing in triggered and it significantly affects system performance and this issue can be replicated manually as well. So, we could trigger balancing before running the test suite.
Other idea is to revert to last good snapshot even run was successful, but this will require changes in os-autoinst which may take longer to implement.
Observation¶
openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-yast2_gui@64bit fails in
yast2_firewall
Reproducible¶
Fails since (at least) Build 20171023
Expected result¶
Last good: 20171022 (or more recent)
Acceptance criteria¶
- AC1: yast modules start up stable within yast2_gui with no TIMEOUT_SCALE applied
- AC2: No TIMEOUT_SCALE set on the test suite on neither osd nor o3
Further details¶
Always latest result in this scenario: latest
Updated by riafarov about 7 years ago
- Related to action #25634: [sle][functional][opensuse][sporadic]test fails in yast2_firewall because "zypper in" times out -> bump timeout added
Updated by riafarov about 7 years ago
- Related to action #26104: test fails in yast2_lang added
Updated by riafarov almost 7 years ago
Experiment with killing yast processes didn't work well with firewall yast module, as we switch to root console twice to install apache module: https://openqa.opensuse.org/tests/538463#step/yast2_firewall/5
And then it doesn't work well and times out in some runs. So it's another point to get some other solution, e.g. rolling to the last good snapshot even test module was successful (currently is missing in openQA) or using YCP here (which may not resolve issues with unreliable start). Another point to check is btrfs balancing, as it's easy to see that system performance degrades during rebalancing which results in slowly starting modules. For that we can run balancing before test execution.
Updated by okurz almost 7 years ago
- Status changed from New to In Progress
- Assignee set to okurz
I wondered why the btrfs balancing hasn't been run in before. I thought we do that as well for the "create_hdd" parent job. But as it turns out the both schedule definition files products/{opensuse,sle}/main.pm differ in this point, sle calls console/force_cron_run, opensuse does not :( That could explain a lot. So IMHO the way to go is to make sure we call console/force_cron_run as well for opensuse but this time the correct way, DRY in main.pm
Updated by okurz almost 7 years ago
Updated by okurz almost 7 years ago
- Assignee deleted (
okurz)
merged. Verification on production, no new fails. https://openqa.opensuse.org/tests/539692# is the first yast2_gui test run in production after force_cron_run. At least it did not fail but I guess it's too early to see any issue as resolved, especially when we still have that timeout scale on the job. So next step: Test for proper statistics, remove TIMEOUT_SCALE and such
Updated by okurz almost 7 years ago
- Subject changed from [opensuse][sle][functional] yast2 gui modules fail to start in the defined time frame to [opensuse][sle][functional][yast][hard] yast2 gui modules fail to start in the defined time frame
- Description updated (diff)
- Due date set to 2018-03-27
- Status changed from New to Workable
- Target version changed from Milestone 14 to Milestone 15
Well, I think "New" is an understatement making people believe no one ever worked on this ticket.
Considering the suggestion from #27004#note-9 it should be "workable" but hard. Also added ACs.
Updated by okurz over 6 years ago
- Due date deleted (
2018-03-27) - Target version changed from Milestone 15 to Milestone 17
no capacity in M15 or M16 left
Updated by okurz over 6 years ago
- Due date set to 2018-04-10
- Target version changed from Milestone 17 to Milestone 15
Actually it seems we do have some [yast] specific capacity. Adding to S14.
Updated by okurz over 6 years ago
- Subject changed from [opensuse][sle][functional][yast][hard] yast2 gui modules fail to start in the defined time frame to [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame
- Due date deleted (
2018-04-10) - Target version changed from Milestone 15 to Milestone 17
nope, we were wrong, back to previous.
Updated by okurz over 6 years ago
- Target version changed from Milestone 17 to Milestone 19
Updated by okurz over 6 years ago
- Target version changed from Milestone 19 to Milestone 19
Updated by okurz about 6 years ago
- Related to coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues added
Updated by okurz about 6 years ago
- Related to deleted (coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues)
Updated by okurz about 6 years ago
- Blocked by coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues added
Updated by okurz about 6 years ago
- Status changed from Workable to Blocked
- Assignee set to okurz
- Target version changed from Milestone 19 to future
Well, on o3 we still run yast2_gui with TIMEOUT_SCALE=5
and yast2_ncurses with TIMEOUT_SCALE=3
so this issue is still valid. By now we have improved the force_scheduled_tasks
module so btrfs maintenance tasks are not triggered anymore in the background. However, other processes, e.g. zypper, can still trigger IO heavy tasks, e.g. handling snapper snapshots which might have bad effect. We try to detect the "known bugs" handling this issue better but are not there yet easily turn a job into a soft-fail based on detecting a "known issue". So IMHO we should work on #39719 first before going further on with this ticket.
Updated by okurz over 5 years ago
- Assignee changed from okurz to riafarov
Move to new QSF-y PO after I moved to the "tools"-team. I mainly checked the subject line so in individual instances you might not agree to take it over completely into QSF-y. Feel free to reassign to me or someone else in this case. Thanks.
Updated by riafarov over 5 years ago
- Blocked by deleted (coordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues)
Updated by riafarov over 5 years ago
- Status changed from Blocked to Rejected
I guess it's time to reject this one, as we have provided multiple mitigations and progress in changing way how we test yast modules too.