Project

General

Profile

Actions

action #104106

closed

[qe-core] test fails in await_install - Network peformace for ppc installations is decreasing size:S

Added by szarate almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Start date:
2021-12-16
Due date:
% Done:

0%

Estimated time:

Description

Observation

Long story short since few builds, await_install is failing sporadically, but does work after retriggering the test... looking a bit closer, the test seems to be getting on the 30ish minute threshold of the timeout, and going strong, but shy of few minutes to finish...

openQA test in scenario sle-15-SP4-Online-ppc64le-create_hdd_gnome@ppc64le-2g fails in
await_install

Test suite description

image creation job used as parent for other jobs testing based on existing installation. To be used as START_AFTER_TEST=create_hdd_gnome

Reproducible

Fails since (at least) Build 74.1 (current job)

Expected result

Last good: 70.1 (or more recent)

Further details

Always latest result in this scenario: latest

Suggestions

  • Increase TIMEOUT_SCALE, and also MAX_SETUP_TIME

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - coordination #102882: [epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache serviceResolvedkraih2022-02-10

Actions
Actions #1

Updated by maritawerner almost 3 years ago

@szarate a ticket for the qe tools team? for the power backend?

Actions #2

Updated by szarate almost 3 years ago

  • Project changed from openQA Tests (public) to openQA Infrastructure (public)
  • Subject changed from test fails in await_install - Network peformace for ppc installations is decreasing to [qe-core] test fails in await_install - Network peformace for ppc installations is decreasing
  • Description updated (diff)
  • Category deleted (Bugs in existing tests)

Created https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/13885 for the time being I'll keep an eye on the problem to see how does this behave in the future, but likely somebody will have to take a look to figure out where's the network bottleneck...

Adding Oliver as a watcher to see if he has any suggestions

Actions #3

Updated by szarate almost 3 years ago

@maritawerner wrote:

@szarate a ticket for the qe tools team? for the power backend?

Whoever has to take care of the infraestructure and network... but for now I'm leaving it within qe-core, but in the infraestructure project, as this has little to do with tests but rather with hardware per se.

Actions #4

Updated by okurz almost 3 years ago

  • Target version set to Ready

Please be aware about #102882 . As long as we have that degraded network performance related to most of our PPC workers we can run less tests in parallel and wait longer, e.g. scale timeouts. Just today I fixed the problem that instances that should have stayed disabled reappeared. Likely mkittler only stopped instances but after a reboot instances came back. So consider any test results on PPC machines from the last day affected by that. I added the ticket to our backlog as we should at least scale timeouts in the host specific configs

Actions #5

Updated by okurz almost 3 years ago

  • Priority changed from Normal to Urgent
Actions #6

Updated by szarate almost 3 years ago

  • Related to coordination #102882: [epic] All OSD PPC64LE workers except malbec appear to have horribly broken cache service added
Actions #7

Updated by livdywan almost 3 years ago

  • Subject changed from [qe-core] test fails in await_install - Network peformace for ppc installations is decreasing to [qe-core] test fails in await_install - Network peformace for ppc installations is decreasing size:S
  • Description updated (diff)
  • Status changed from New to Workable
  • Assignee set to mkittler
Actions #8

Updated by mkittler almost 3 years ago

  • Status changed from Workable to Feedback

SR to workaround the slow network performance: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/375

Actions #9

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Resolved

MR was merged and the change to the worker config is effective. In the scenario from the original description the last job within latest was from four days ago so before the worker config changes. However this should suffice for the ticket at hand. There is still #102882 about the real issue. Anyone feel free to use auto-review with a regex in #102882 to automatically catch and retrigger such failures until a proper solution could be found for the root cause of degraded network performance

Actions #10

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: create_hdd_gnome@ppc64le-2g
https://openqa.suse.de/tests/8158017

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234
Actions

Also available in: Atom PDF