Project

General

Profile

coordination #101048

[epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

Added by okurz about 2 months ago. Updated 4 days ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
Concrete Bugs
Target version:
Start date:
2021-10-15
Due date:
% Done:

67%

Estimated time:
(Total: 0.00 h)
Difficulty:

Description

Observation

According to https://monitor.qa.suse.de/d/nRDab3Jiz/openqa-jobs-test?viewPanel=27&orgId=1&from=now-30d&to=now (sort by "avg" in the table on the right-hand side) openqaworker-arm-4/5 have a fail-ratio of 33-36% vs. openqaworker-arm-1/2/3 with a fail-ratio of 15-17%

Acceptance criteria

  • AC1: openqaworker-arm-4/5 have a fail-ratio less or equal to arm-1/2/3

Suggestions

  • Confirm if typing issues cause the failures (look for timeouts, observe additional or missing characters in typed commands)
  • Upgrade arm3 to Leap 15.3 and compare failure rate -> #101265 => Leap 15.3 behaves similar as Leap 15.2
  • Consider switching to kernel-stable or kernel-head -> #101271 => "kernel-default" from Kernel:stable behaves same as openSUSE:Leap:15.3 one
  • Consider downgrading kernel to what's used in 15.2 -> same upstream version is running on most
  • Bring back arm 4 and 5 after verifying stability
  • Run typing.pm from os-autoinst as test in production -> #101262
sysctl_diff.html (39.3 KB) sysctl_diff.html arm4 left, arm3 right nicksinger, 2021-10-18 11:36

Subtasks

action #101262: Document running os-autoinst full-stack.t on OSD workers size:MWorkable

action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:MResolvedmkittler

openQA Infrastructure - action #101271: Try Kernel:stable on arm4+arm5 and compare failure rate size:MResolvedkraih


Related issues

Related to openQA Project - action #101030: Typing problems on aarch64Resolved2021-10-15

Copied to openQA Project - action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:MResolved2021-10-15

History

#1 Updated by okurz about 1 month ago

#2 Updated by nicksinger about 1 month ago

First investigation shows that we run leap15.2 on the "old" workers and 15.3 on the new ones. Kernel-version seems to be quite different between different workers:
arm-1: 5.8.3-1.gbad027a-default
arm-2: 5.7.12-1.g9c98feb-default
arm-3: 5.3.18-lp152.95-default
arm-4: 5.3.18-59.27-default
arm-5: 5.3.18-59.27-default

given that arm3 has at least a similar kernel version I'd exclude kernel for now.
Kernel cmdline is managed by salt and therefore the same on all 5 machines.

I also tried to diff the sysctl's currently set in the system. Due to different kernels this is a quite tedious task and I didn't see much which could make a difference here. Attaching the diff as html.

#3 Updated by nicksinger about 1 month ago

despite being way less utilized arm4 sees load-spikes up to 75 and is around 25 quite constantly according to: https://monitor.qa.suse.de/d/WDopenqaworker-arm-4/worker-dashboard-openqaworker-arm-4?viewPanel=54694&orgId=1&from=1633952613651&to=1634557413651 - this could hint to IO performance issues.

#5 Updated by nicksinger about 1 month ago

network IO seems fine. Higher packet drops can be observed on arm4 & arm5 (exactly the same pattern, so hinting to the switch) but IMHO this shouldn't cause such a performance-hit

#6 Updated by okurz about 1 month ago

  • Priority changed from High to Urgent

#7 Updated by cdywan about 1 month ago

  • Description updated (diff)

#8 Updated by cdywan about 1 month ago

  • Tracker changed from action to coordination
  • Subject changed from Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3
  • Description updated (diff)

#9 Updated by cdywan about 1 month ago

  • Copied to action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:M added

#10 Updated by okurz about 1 month ago

  • Description updated (diff)

#11 Updated by okurz about 1 month ago

  • Description updated (diff)

#12 Updated by okurz about 1 month ago

  • Status changed from New to Blocked
  • Assignee set to okurz

blocked by subtasks

#13 Updated by okurz 16 days ago

  • Description updated (diff)

#14 Updated by okurz 16 days ago

  • Description updated (diff)

#15 Updated by okurz 16 days ago

  • Status changed from Blocked to New
  • Assignee deleted (okurz)

I updated the epic with the results from #101265 and #101271 . We can now continue defining more hypotheses to follow-up with.

#16 Updated by okurz 14 days ago

  • Subject changed from [epic] Investigate higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 to [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3

#17 Updated by kraih 13 days ago

There's still a lot of failed jobs from the #101271 stress test that should be searched for patterns. Maybe that will give some hints for where to look with followup investigations.

#18 Updated by okurz 12 days ago

  • Status changed from New to Workable

Also available in: Atom PDF