Project

General

Profile

Actions

action #63724

closed

[functional][y][sporadic][fast] "Stall was detected" in accept license, only with full medium

Added by JRivrain about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
-
Start date:
2020-02-21
Due date:
2020-03-24
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

We have a very frequent failures in that module - specially in last build - only with full medium. We need to determine if it's a product bug or an infrastructure problem. So for Infra, please check wether there is some bottleneck happening.

openQA test in scenario sle-15-SP2-Full-aarch64-skip_registration@aarch64 fails in
accept_license

Test suite description

Maintainer: okurz, riafarov

Like a standard scenario with explicit skipping of SCC registration in case where we register by default, e.g. for SLE >= 15
See https://progress.opensuse.org/issues/25264 for details.

Reproducible

Fails since (at least) Build 139.1

Expected result

Last good: 136.2 (or more recent)

Further details

Always latest result in this scenario: latest

Actions #1

Updated by okurz about 4 years ago

As I am at least sure there weren't any relevant explicit changes in the infrastructure I suggest you at least crosscheck "last good build" to see if it now also shows problems.

Actions #2

Updated by JRivrain about 4 years ago

I see this happening sporadically since build 128.1, but seems to be much more frequent (not even one successful run for that particular suite ) since build 139.1

Actions #3

Updated by JRivrain about 4 years ago

So probably that image would do http://mirror.suse.cz/install/SLE-15-SP2-Full-Beta2/, but I do not know how / have access to the production machines to upload it. And the old jobs cannot be just re-triggered as the assets are gone.

Actions #4

Updated by okurz about 4 years ago

  • Project changed from openQA Infrastructure to openQA Tests
  • Subject changed from "Stall was detected" in accept license, only with full medium to [functional][y][sporadic] "Stall was detected" in accept license, only with full medium
  • Category set to Bugs in existing tests

I'm sorry, I don't know how I can help you. I don't see a problem specific to the infrastructure. I strongly suggest you look into a structured test investigation to see what could be the source of the problem.

Actions #5

Updated by zluo about 4 years ago

I observed this issue on aarch64 worker:

https://openqa.suse.de/tests/3956245#step/accept_license/3

this is a hardware configuration issue, openqaworker-arm-2:8

We need to reduce the amount of workers in general or take out machine which makes trouble all the time.

Actions #6

Updated by okurz about 4 years ago

zluo wrote:

this is a hardware configuration issue, openqaworker-arm-2:8
We need to reduce the amount of workers in general or take out machine which makes trouble all the time.

This is a rather drastic and costly change that I don't support as we are limited with the aarch64 testing ressources and many tests do not have a problem. It is clear that the three aarch64 machines we have within osd are not really stable or reliable but it's the best we currently have. If you want to support then please provide proper measurements, e.g. compare openqaworker-arm-1 or openqaworker-arm-3 which have both less instances currently configured against openqaworker-arm-2 and see if your hypothesis holds true. But we need proper statistics, e.g. fail rate mean+-std, not single jobs failing.

Actions #7

Updated by zluo about 4 years ago

@okurz:

... three aarch64 machines we have within osd are not really stable or reliable ...

Good to see that performance issue is confirmed by you, thanks.

And how to handle this issue is a another question. My suggestion is based on a lots of reviews on osd:

reduce amount of openqaworker-arm-2 by 10% at first or assign more RAM to each worker

review the results and see if this is sufficient.

take openqaworker-arm-1 and openqaworker-arm-3 out from osd because these machines don't meet the requirement for production.

Actions #8

Updated by okurz about 4 years ago

zluo wrote:

[…]
reduce amount of openqaworker-arm-2 by 10% at first or assign more RAM to each worker

review the results and see if this is sufficient.

take openqaworker-arm-1 and openqaworker-arm-3 out from osd because these machines don't meet the requirement for production.

Interesting, why do you think openqaworker-arm-1 and openqaworker-arm-3 are worse than openqaworker-arm-2? What is this based on?

Actions #9

Updated by zluo about 4 years ago

because you told me that those machines configured with 4 workers only and they failed on osd sometimes. Please correct me if this is not true.

Actions #10

Updated by okurz about 4 years ago

zluo wrote:

because you told me that those machines configured with 4 workers only and they failed on osd sometimes. Please correct me if this is not true.

It is correct that both openqaworker-arm-1 and openqaworker-arm-3 have currently less worker instances configured each. The reason is not that they are worse than arm-2 but so that you can test and see if the number of instances matter. So far I have not seen any evidence backing this hypothesis.

Actions #11

Updated by zluo about 4 years ago

http://openqa.suse.de/tests/3982686#step/reboot_gnome/18 shows issue with "Stall detection" on ppc64le as well. It seems this issue is not limited to aarch64, and this seems to happen often for reboot.

Actions #12

Updated by JERiveraMoya about 4 years ago

I can see another stall for the same test suite in aarch64 in a different module: https://openqa.suse.de/tests/3981340#step/shutdown/4
and this problem is found in workers 1,2 and 3. You can re-open this bug in case we could really see a crash there.

Actions #13

Updated by riafarov about 4 years ago

  • Subject changed from [functional][y][sporadic] "Stall was detected" in accept license, only with full medium to [functional][y][sporadic][fast] "Stall was detected" in accept license, only with full medium
  • Due date set to 2020-03-24
  • Assignee set to riafarov

zluo wrote:

http://openqa.suse.de/tests/3982686#step/reboot_gnome/18 shows issue with "Stall detection" on ppc64le as well. It seems this issue is not limited to aarch64, and this seems to happen often for reboot.

Let's not mix these two issues here. I will disable Y2DEBUG=1 on arm, as well as self update to improve stability on arm. Let's see if that helps.

Actions #14

Updated by riafarov about 4 years ago

So disabling Y2DEBUG and self-update didn't help.

Actions #15

Updated by riafarov about 4 years ago

  • Status changed from New to Feedback

PR is created: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/9825
With soft-failure we can track if workaround is still required and revert the change once issue is resolved.

Actions #16

Updated by riafarov about 4 years ago

  • Status changed from Feedback to Resolved

No failures in the recent builds.

Actions #17

Updated by leli about 4 years ago

Hi Rodion, do you think this is the same issue? https://openqa.nue.suse.com/tests/4081755#step/first_boot/10

Actions #18

Updated by riafarov about 4 years ago

leli wrote:

Hi Rodion, do you think this is the same issue? https://openqa.nue.suse.com/tests/4081755#step/first_boot/10

Sorry for the late reply, as per our discussion on RC, it's different issue and qemu throws "Guest display disabled" message, so might be that it's caused by used different video device than image we use for the upgrade.

Actions

Also available in: Atom PDF