Project

General

Profile

Actions

coordination #101048

open

[epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Added by okurz over 2 years ago. Updated 11 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Regressions/Crashes
Target version:
Start date:
2021-10-15
Due date:
% Done:

82%

Estimated time:
(Total: 0.00 h)

Description

Observation

According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%

Acceptance criteria

  • AC1: openqaworker-arm-4/5 have a fail-ratio less or equal to arm-1/2/3

Additional information and ideas from the hardware comparison between arm-1/2/3 and arm-4/5

  • The CPU (the specific model and version, Cavium ThunderX2) of arm-4/5 is known to behave badly for our use-case and that's the difference to the older arm workers (which have the previous version of that CPU model installed).
  • Disabling cpu control and cpu frequency scaling in the firmware environment didn't make a difference.
    • Before that we've already tried to reduce the number of worker slots a lot and it didn't help either.
    • There are still a few ideas to consider (see #109232#note-5).
    • There are also more variables in the firmware environment (see #109232#note-20) we can play with.
  • Next time we should buy different hardware (see private comment #109232#note-11).
  • See the full ticket #109232 for more context about these findings.

Suggestions

  • Confirm if typing issues cause the failures (look for timeouts, observe additional or missing characters in typed commands)
  • Upgrade arm3 to Leap 15.3 and compare failure rate -> #101265 => Leap 15.3 behaves similar as Leap 15.2
  • Consider switching to kernel-stable or kernel-head -> #101271 => "kernel-default" from Kernel:stable behaves same as openSUSE:Leap:15.3 one
  • Consider downgrading kernel to what's used in 15.2 -> same upstream version is running on most
  • Bring back arm 4 and 5 after verifying stability
  • Run typing.pm from os-autoinst as test in production -> #101262

Files

sysctl_diff.html (39.3 KB) sysctl_diff.html arm4 left, arm3 right nicksinger, 2021-10-18 11:36

Subtasks 11 (2 open9 closed)

action #101262: Document running os-autoinst full-stack.t on OSD workers size:MResolvedokurz2021-10-21

Actions
action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:MResolvedmkittler2021-10-15

Actions
openQA Infrastructure - action #101271: Try Kernel:stable on arm4+arm5 and compare failure rate size:MResolvedkraih2021-10-15

Actions
openQA Infrastructure - action #104304: Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" size:MResolvedmkittler2021-12-22

Actions
action #109232: Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:MResolvedmkittler2022-03-30

Actions
openQA Infrastructure - action #109494: Restore network connection of arm-4/5 size:MResolvednicksinger2022-04-05

Actions
openQA Infrastructure - action #110539: Ask OBS team if they would like to swap ARM workers with usResolvedokurz2022-05-02

Actions
action #110542: Try to mitigate "VNC typing issues" with disabled key repeatResolvedokurz2022-05-02

Actions
openQA Infrastructure - action #110545: Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:MWorkable2022-05-02

Actions
openQA Infrastructure - action #111578: Recover openqaworker-arm-4/5 after "bricking" in #110545 size:MResolvednicksinger

Actions
action #113441: Try to mitigate "VNC typing issues" with disabled key repeat in linux tty's of qemu testsNew

Actions

Related issues 1 (0 open1 closed)

Related to openQA Project - action #101030: Typing problems on aarch64Resolvedokurz2021-10-15

Actions
Actions

Also available in: Atom PDF