Project

General

Profile

action #60833

[qe-core][sle][functional] performance issue of aarch64 worker: Stall detected

Added by zluo almost 2 years ago. Updated 7 months ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
2019-12-10
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

see related issue reported in https://progress.opensuse.org/issues/56087

We have statics that shows clearly performance issue on openqaworker-arm-1 and openqaworker-arm-2:

https://openqa.suse.de/tests/latest?arch=aarch64&distri=sle&flavor=Online&machine=aarch64&test=zluo-poo56087&version=15-SP2#next_previous

We should reduce the amount of workers on these two machines.


Related issues

Related to openQA Tests - action #46190: [functional][u] test fails in user_settings - mistyping in Username (lowercase instead of uppercase) Resolved2019-01-15

Related to openQA Tests - action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues"Resolved2017-10-09

Blocked by openQA Infrastructure - action #41882: all arm worker die after some timeResolved2018-10-02

History

#1 Updated by zluo almost 2 years ago

  • Related to action #46190: [functional][u] test fails in user_settings - mistyping in Username (lowercase instead of uppercase) added

#2 Updated by okurz almost 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: extra_tests_on_gnome
https://openqa.suse.de/tests/3731488

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#3 Updated by okurz over 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: gnome+proxy_SCC+allmodules
https://openqa.suse.de/tests/3758637

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed

#4 Updated by okurz over 1 year ago

  • Category set to Bugs in existing tests

#5 Updated by SLindoMansilla over 1 year ago

Workers are again using the old number of workers which is known to produce typing issues: https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/openqa/workerconf.sls#L489

This commit https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/cef2ca2755860394d0ace4178ef51cc800dc34fe suggest to mask services and tools team agreed that while investigating masking should be used.
Once the right amount of workers is known, this should be change in the salt state https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/openqa/workerconf.sls#L489

#6 Updated by SLindoMansilla over 1 year ago

  • Assignee set to SLindoMansilla
  • Priority changed from Normal to High

Performing binary search for openqaworker-arm-1.
Starting workers: 20
Trying with: 10 (workers from 11 to 20 stopped and masked)

#7 Updated by okurz over 1 year ago

I learned that masked servers make salt state apply fail so I reverted my masking and updated salt pillars accordingly. If you want to experiment with "less workers" I suggest to pin test jobs to openqaworker-arm-3 which is reduced to 4 worker instances in parallel for now. We can run the experiment but we will need to unmask worker instances again as soon as we have problems with salt recipe application.

#8 Updated by SLindoMansilla over 1 year ago

  • Assignee deleted (SLindoMansilla)

Approach is not accepted by tools team.
To decide in next refinement meeting.

#9 Updated by okurz over 1 year ago

The approach is accepted when you use salt pillar changes and not simply masking systemd services to not break salt.

#10 Updated by zluo over 1 year ago

#25864 is actually old ticket which has been worked by okurz

#11 Updated by SLindoMansilla over 1 year ago

  • Assignee set to mgriessmeier
  1. Increase QEMURAM for openqaworker-arm-1 and openqaworker-arm-3.
  2. Ask Santi about requirements for an ARM server for test environment (openqa.suse.de).
  3. Show requirements to Ralf and see if it is possible to acquire such hardware.

#12 Updated by okurz over 1 year ago

SLindoMansilla wrote:

  1. Increase QEMURAM for openqaworker-arm-1 and openqaworker-arm-3.

Yes. This would also go in line with #46190#note-88

  1. Ask Santi about requirements for an ARM server for test environment (openqa.suse.de).
  2. Show requirements to Ralf and see if it is possible to acquire such hardware.

New ARM hardware as already requested, see https://trello.com/c/JQtnALhz/6-openqa-hw-budget-planning#comment-5e185a3e9a5c3786c32fd089

#13 Updated by okurz over 1 year ago

  • Related to action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues" added

#15 Updated by SLindoMansilla over 1 year ago

  • Parent task deleted (#56087)

#16 Updated by SLindoMansilla over 1 year ago

  • Status changed from New to Blocked
  • Assignee changed from mgriessmeier to szarate

#17 Updated by SLindoMansilla over 1 year ago

  • Blocked by action #41882: all arm worker die after some time added

#18 Updated by okurz over 1 year ago

  • Status changed from Blocked to Workable

@SLindoMansilla #41882 is about machines crashing completely, not about performance issues per se. Please do not use that as blocker. If there is something specific I could help you with I am happy to help.

#19 Updated by tjyrinki_suse 11 months ago

  • Subject changed from [sle][functional][u] performance issue of aarch64 worker: Stall detected to [qe-core][sle][functional] performance issue of aarch64 worker: Stall detected

#20 Updated by szarate 7 months ago

  • Category changed from Bugs in existing tests to Infrastructure
  • Assignee deleted (szarate)

I will not be taking at stalls for now...

#21 Updated by szarate 7 months ago

  • Status changed from Workable to Rejected
  • Assignee set to szarate

I don't see it referenced anymore, and stalls + aarch64 is usually a bad combination on Caviums

Also available in: Atom PDF