Project

General

Profile

Actions

action #60833

closed

[qe-core][sle][functional] performance issue of aarch64 worker: Stall detected

Added by zluo about 5 years ago. Updated almost 4 years ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
Infrastructure
Target version:
-
Start date:
2019-12-10
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

see related issue reported in https://progress.opensuse.org/issues/56087

We have statics that shows clearly performance issue on openqaworker-arm-1 and openqaworker-arm-2:

https://openqa.suse.de/tests/latest?arch=aarch64&distri=sle&flavor=Online&machine=aarch64&test=zluo-poo56087&version=15-SP2#next_previous

We should reduce the amount of workers on these two machines.


Related issues 3 (0 open3 closed)

Related to openQA Tests (public) - action #46190: [functional][u] test fails in user_settings - mistyping in Username (lowercase instead of uppercase) ResolvedSLindoMansilla2019-01-15

Actions
Related to openQA Tests (public) - action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues"Resolvedokurz2017-10-09

Actions
Blocked by openQA Infrastructure (public) - action #41882: all arm worker die after some timeResolvedokurz2018-10-02

Actions
Actions #1

Updated by zluo about 5 years ago

  • Related to action #46190: [functional][u] test fails in user_settings - mistyping in Username (lowercase instead of uppercase) added
Actions #2

Updated by okurz almost 5 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: extra_tests_on_gnome
https://openqa.suse.de/tests/3731488

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #3

Updated by okurz almost 5 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: gnome+proxy_SCC+allmodules
https://openqa.suse.de/tests/3758637

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #4

Updated by okurz almost 5 years ago

  • Category set to Bugs in existing tests
Actions #5

Updated by SLindoMansilla almost 5 years ago

Workers are again using the old number of workers which is known to produce typing issues: https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/openqa/workerconf.sls#L489

This commit https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/cef2ca2755860394d0ace4178ef51cc800dc34fe suggest to mask services and tools team agreed that while investigating masking should be used.
Once the right amount of workers is known, this should be change in the salt state https://gitlab.suse.de/openqa/salt-pillars-openqa/blob/master/openqa/workerconf.sls#L489

Actions #6

Updated by SLindoMansilla almost 5 years ago

  • Assignee set to SLindoMansilla
  • Priority changed from Normal to High

Performing binary search for openqaworker-arm-1.
Starting workers: 20
Trying with: 10 (workers from 11 to 20 stopped and masked)

Actions #7

Updated by okurz almost 5 years ago

I learned that masked servers make salt state apply fail so I reverted my masking and updated salt pillars accordingly. If you want to experiment with "less workers" I suggest to pin test jobs to openqaworker-arm-3 which is reduced to 4 worker instances in parallel for now. We can run the experiment but we will need to unmask worker instances again as soon as we have problems with salt recipe application.

Actions #8

Updated by SLindoMansilla almost 5 years ago

  • Assignee deleted (SLindoMansilla)

Approach is not accepted by tools team.
To decide in next refinement meeting.

Actions #9

Updated by okurz almost 5 years ago

The approach is accepted when you use salt pillar changes and not simply masking systemd services to not break salt.

Actions #10

Updated by zluo almost 5 years ago

#25864 is actually old ticket which has been worked by okurz

Actions #11

Updated by SLindoMansilla almost 5 years ago

  • Assignee set to mgriessmeier
  1. Increase QEMURAM for openqaworker-arm-1 and openqaworker-arm-3.
  2. Ask Santi about requirements for an ARM server for test environment (openqa.suse.de).
  3. Show requirements to Ralf and see if it is possible to acquire such hardware.
Actions #12

Updated by okurz almost 5 years ago

SLindoMansilla wrote:

  1. Increase QEMURAM for openqaworker-arm-1 and openqaworker-arm-3.

Yes. This would also go in line with #46190#note-88

  1. Ask Santi about requirements for an ARM server for test environment (openqa.suse.de).
  2. Show requirements to Ralf and see if it is possible to acquire such hardware.

New ARM hardware as already requested, see https://trello.com/c/JQtnALhz/6-openqa-hw-budget-planning#comment-5e185a3e9a5c3786c32fd089

Actions #13

Updated by okurz almost 5 years ago

  • Related to action #25864: [tools][functional][u] stall detected in openqaworker-arm-1 through 3 sometimes - "worker performance issues" added
Actions #15

Updated by SLindoMansilla over 4 years ago

  • Parent task deleted (#56087)
Actions #16

Updated by SLindoMansilla over 4 years ago

  • Status changed from New to Blocked
  • Assignee changed from mgriessmeier to szarate
Actions #17

Updated by SLindoMansilla over 4 years ago

  • Blocked by action #41882: all arm worker die after some time added
Actions #18

Updated by okurz over 4 years ago

  • Status changed from Blocked to Workable

@SLindoMansilla #41882 is about machines crashing completely, not about performance issues per se. Please do not use that as blocker. If there is something specific I could help you with I am happy to help.

Actions #19

Updated by tjyrinki_suse about 4 years ago

  • Subject changed from [sle][functional][u] performance issue of aarch64 worker: Stall detected to [qe-core][sle][functional] performance issue of aarch64 worker: Stall detected
Actions #20

Updated by szarate almost 4 years ago

  • Category changed from Bugs in existing tests to Infrastructure
  • Assignee deleted (szarate)

I will not be taking at stalls for now...

Actions #21

Updated by szarate almost 4 years ago

  • Status changed from Workable to Rejected
  • Assignee set to szarate

I don't see it referenced anymore, and stalls + aarch64 is usually a bad combination on Caviums

Actions

Also available in: Atom PDF