Project

General

Profile

Actions

action #162602

open

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S

Added by okurz 28 days ago. Updated 11 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-20
Due date:
% Done:

0%

Estimated time:

Description

Observation

With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596

Suggestions

  • Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
  • Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
  • Lower the load limit
  • Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
  • Take #162596 into account

Rollback actions


Related issues 3 (1 open2 closed)

Related to openQA Infrastructure - action #162719: Ensure w40 has more space for worker pool directories size:SResolveddheidler2024-06-21

Actions
Copied from openQA Infrastructure - action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retryBlockedokurz

Actions
Copied to openQA Infrastructure - action #162605: [FIRING:1] CPU load alert, should be "system load"Resolvedokurz2024-06-20

Actions
Actions #1

Updated by okurz 28 days ago

  • Copied from action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry added
Actions #2

Updated by okurz 28 days ago

  • Copied to action #162605: [FIRING:1] CPU load alert, should be "system load" added
Actions #3

Updated by livdywan 27 days ago

  • Subject changed from [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) to [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 27 days ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #7

Updated by livdywan 26 days ago

okurz wrote in #note-6:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/845 merged

So what's the next step? Monitor CPU load for the afternoon?

Actions #8

Updated by okurz 26 days ago

  • Status changed from In Progress to Resolved

limit is effective. rollback action done. No alert right now. Will be notified if the alert would still trigger

Actions #9

Updated by okurz 25 days ago

  • Status changed from Resolved to In Progress

Apparently not enough, need to silence alerts and see what to do about it

Actions #10

Updated by openqa_review 25 days ago

  • Due date set to 2024-07-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by okurz 23 days ago

  • Priority changed from Urgent to High

Silenced alert again for now.

I looked at https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=now-7d&to=now
and found that there are short-timed load spikes coinciding with significant, non-critical memory usage, low CPU usage but high I/O usage with maxing-out I/O times. As observed the past days during those times it seems that there are especially high-demanding openQA jobs with bigger HDDSIZE requests and according high I/O demands. I assume that we are not actually hitting typing issues in such cases but still stall test execution and trigger the observed alerts.

So I proposed
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/848
to reduce worker load limit 30->25.

Next idea: Limit number of "qemu_x86_64-large-mem" to a limited number of instances or introduce a new class and ask SAP-HA squad to schedule tests against those in particular.

Actions #14

Updated by okurz 16 days ago

  • Status changed from In Progress to Feedback
Actions #15

Updated by okurz 11 days ago

  • Related to action #162719: Ensure w40 has more space for worker pool directories size:S added
Actions #16

Updated by okurz 11 days ago

  • Due date deleted (2024-07-07)
  • Status changed from Feedback to Blocked
  • Priority changed from High to Normal

the proposal for "size" classes stands and I would like to give it more time to decide if that is the right approach. Then there is also the impact of space depletion which coincided with too high load so blocking on #162719

Actions

Also available in: Atom PDF