Project

General

Profile

Actions

action #162602

open

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S

Added by okurz 10 days ago. Updated 2 days ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-20
Due date:
2024-07-07 (Due in 7 days)
% Done:

0%

Estimated time:

Description

Observation

With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596

Suggestions

  • Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
  • Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
  • Lower the load limit
  • Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
  • Take #162596 into account

Rollback actions


Related issues 2 (2 open0 closed)

Copied from openQA Infrastructure - action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retryFeedbackokurz

Actions
Copied to openQA Infrastructure - action #162605: [FIRING:1] CPU load alert, should be "system load"Feedbackokurz2024-06-202024-07-05

Actions
Actions #1

Updated by okurz 10 days ago

  • Copied from action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry added
Actions #2

Updated by okurz 10 days ago

  • Copied to action #162605: [FIRING:1] CPU load alert, should be "system load" added
Actions #3

Updated by livdywan 10 days ago

  • Subject changed from [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) to [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 9 days ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz
Actions #7

Updated by livdywan 9 days ago

okurz wrote in #note-6:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/845 merged

So what's the next step? Monitor CPU load for the afternoon?

Actions #8

Updated by okurz 9 days ago

  • Status changed from In Progress to Resolved

limit is effective. rollback action done. No alert right now. Will be notified if the alert would still trigger

Actions #9

Updated by okurz 8 days ago

  • Status changed from Resolved to In Progress

Apparently not enough, need to silence alerts and see what to do about it

Actions #10

Updated by openqa_review 7 days ago

  • Due date set to 2024-07-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by okurz 6 days ago

  • Priority changed from Urgent to High

Silenced alert again for now.

I looked at https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=now-7d&to=now
and found that there are short-timed load spikes coinciding with significant, non-critical memory usage, low CPU usage but high I/O usage with maxing-out I/O times. As observed the past days during those times it seems that there are especially high-demanding openQA jobs with bigger HDDSIZE requests and according high I/O demands. I assume that we are not actually hitting typing issues in such cases but still stall test execution and trigger the observed alerts.

So I proposed
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/848
to reduce worker load limit 30->25.

Next idea: Limit number of "qemu_x86_64-large-mem" to a limited number of instances or introduce a new class and ask SAP-HA squad to schedule tests against those in particular.

Actions

Also available in: Atom PDF