action #27004

[opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame

Added by riafarov over 2 years ago. Updated 11 months ago.

Status:RejectedStart date:25/10/2017
Priority:NormalDue date:
Assignee:riafarov% Done:

0%

Category:Bugs in existing tests
Target version:QA - future
Difficulty:hard
Duration:

Description

In yast2_gui test suite we have random failures most of which appear due to yast module not being able to start.
We have default timeout of 60 seconds, which is not always enough. We also tried to bump all timeouts using TIMEOUT_SCALE variable, which helped, but has not resolved the problem.

Hence, we need to investigate what causes it.

One of the causes is that btrfs balancing in triggered and it significantly affects system performance and this issue can be replicated manually as well. So, we could trigger balancing before running the test suite.
Other idea is to revert to last good snapshot even run was successful, but this will require changes in os-autoinst which may take longer to implement.

Observation

openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-yast2_gui@64bit fails in
yast2_firewall

Reproducible

Fails since (at least) Build 20171023

Expected result

Last good: 20171022 (or more recent)

Acceptance criteria

  • AC1: yast modules start up stable within yast2_gui with no TIMEOUT_SCALE applied
  • AC2: No TIMEOUT_SCALE set on the test suite on neither osd nor o3

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Tests - action #25634: [sle][functional][opensuse][sporadic]test fails in yast2_... Resolved 28/09/2017 08/11/2017
Related to openQA Tests - action #26104: test fails in yast2_lang Resolved 17/10/2017

History

#1 Updated by okurz over 2 years ago

  • Due date set to 22/11/2017

#2 Updated by riafarov over 2 years ago

  • Related to action #25634: [sle][functional][opensuse][sporadic]test fails in yast2_firewall because "zypper in" times out -> bump timeout added

#3 Updated by riafarov over 2 years ago

#4 Updated by okurz over 2 years ago

  • Target version set to Milestone 14

#5 Updated by okurz over 2 years ago

  • Due date deleted (22/11/2017)

#6 Updated by riafarov over 2 years ago

Experiment with killing yast processes didn't work well with firewall yast module, as we switch to root console twice to install apache module: https://openqa.opensuse.org/tests/538463#step/yast2_firewall/5
And then it doesn't work well and times out in some runs. So it's another point to get some other solution, e.g. rolling to the last good snapshot even test module was successful (currently is missing in openQA) or using YCP here (which may not resolve issues with unreliable start). Another point to check is btrfs balancing, as it's easy to see that system performance degrades during rebalancing which results in slowly starting modules. For that we can run balancing before test execution.

#7 Updated by okurz over 2 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

I wondered why the btrfs balancing hasn't been run in before. I thought we do that as well for the "create_hdd" parent job. But as it turns out the both schedule definition files products/{opensuse,sle}/main.pm differ in this point, sle calls console/force_cron_run, opensuse does not :( That could explain a lot. So IMHO the way to go is to make sure we call console/force_cron_run as well for opensuse but this time the correct way, DRY in main.pm

#9 Updated by okurz over 2 years ago

  • Assignee deleted (okurz)

merged. Verification on production, no new fails. https://openqa.opensuse.org/tests/539692# is the first yast2_gui test run in production after force_cron_run. At least it did not fail but I guess it's too early to see any issue as resolved, especially when we still have that timeout scale on the job. So next step: Test for proper statistics, remove TIMEOUT_SCALE and such

#10 Updated by riafarov about 2 years ago

  • Status changed from In Progress to New

#11 Updated by okurz about 2 years ago

  • Subject changed from [opensuse][sle][functional] yast2 gui modules fail to start in the defined time frame to [opensuse][sle][functional][yast][hard] yast2 gui modules fail to start in the defined time frame
  • Description updated (diff)
  • Due date set to 27/03/2018
  • Status changed from New to Workable
  • Target version changed from Milestone 14 to Milestone 15

Well, I think "New" is an understatement making people believe no one ever worked on this ticket.

Considering the suggestion from #27004#note-9 it should be "workable" but hard. Also added ACs.

#12 Updated by okurz about 2 years ago

  • Due date deleted (27/03/2018)
  • Target version changed from Milestone 15 to Milestone 17

no capacity in M15 or M16 left

#13 Updated by okurz about 2 years ago

  • Due date set to 10/04/2018
  • Target version changed from Milestone 17 to Milestone 15

Actually it seems we do have some [yast] specific capacity. Adding to S14.

#14 Updated by cwh about 2 years ago

  • Difficulty set to hard

#15 Updated by okurz about 2 years ago

  • Subject changed from [opensuse][sle][functional][yast][hard] yast2 gui modules fail to start in the defined time frame to [opensuse][sle][functional][yast][y][hard] yast2 gui modules fail to start in the defined time frame
  • Due date deleted (10/04/2018)
  • Target version changed from Milestone 15 to Milestone 17

nope, we were wrong, back to previous.

#16 Updated by okurz almost 2 years ago

  • Target version changed from Milestone 17 to Milestone 19

#17 Updated by okurz almost 2 years ago

  • Target version changed from Milestone 19 to Milestone 19

#18 Updated by okurz over 1 year ago

  • Related to action #39719: [saga][epic] Detect "known failures" and mark jobs as such added

#19 Updated by okurz over 1 year ago

  • Related to deleted (action #39719: [saga][epic] Detect "known failures" and mark jobs as such)

#20 Updated by okurz over 1 year ago

  • Blocked by action #39719: [saga][epic] Detect "known failures" and mark jobs as such added

#21 Updated by okurz over 1 year ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz
  • Target version changed from Milestone 19 to future

Well, on o3 we still run yast2_gui with TIMEOUT_SCALE=5 and yast2_ncurses with TIMEOUT_SCALE=3 so this issue is still valid. By now we have improved the force_scheduled_tasks module so btrfs maintenance tasks are not triggered anymore in the background. However, other processes, e.g. zypper, can still trigger IO heavy tasks, e.g. handling snapper snapshots which might have bad effect. We try to detect the "known bugs" handling this issue better but are not there yet easily turn a job into a soft-fail based on detecting a "known issue". So IMHO we should work on #39719 first before going further on with this ticket.

#22 Updated by okurz 11 months ago

  • Assignee changed from okurz to riafarov

Move to new QSF-y PO after I moved to the "tools"-team. I mainly checked the subject line so in individual instances you might not agree to take it over completely into QSF-y. Feel free to reassign to me or someone else in this case. Thanks.

#23 Updated by riafarov 11 months ago

  • Blocked by deleted (action #39719: [saga][epic] Detect "known failures" and mark jobs as such)

#24 Updated by riafarov 11 months ago

  • Status changed from Blocked to Rejected

I guess it's time to reject this one, as we have provided multiple mitigations and progress in changing way how we test yast modules too.

Also available in: Atom PDF