action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #162602

closed

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S

Added by okurz 6 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-06-20

Due date:

% Done:

0%

Estimated time:

Tags:

alert, osd, infra, worker40

Description

Observation¶

With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596

Suggestions¶

Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
Lower the load limit
Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
Take #162596 into account

Rollback actions¶

Remove alert rule_uid=~load_alert_worker40 from https://monitor.qa.suse.de/alerting/silences

Related issues 3 (0 open — 3 closed)

Actions

#1

Updated by okurz 6 months ago

Copied from action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) added

Actions

#2

Updated by okurz 6 months ago

Copied to action #162605: [FIRING:1] CPU load alert, should be "system load" added

Actions

#3

Updated by livdywan 6 months ago

Subject changed from [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) to [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
Description updated (diff)
Status changed from New to Workable

Actions

#4

Updated by okurz 6 months ago

Status changed from Workable to In Progress
Assignee set to okurz

Actions

#5

Updated by okurz 6 months ago

Description updated (diff)

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/845

Actions

#6

Updated by okurz 6 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/845 merged

Actions

#7

Updated by livdywan 6 months ago

okurz wrote in #note-6:

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/845 merged

So what's the next step? Monitor CPU load for the afternoon?

Actions

#8

Updated by okurz 6 months ago

Status changed from In Progress to Resolved

limit is effective. rollback action done. No alert right now. Will be notified if the alert would still trigger

Actions

#9

Updated by okurz 6 months ago

Status changed from Resolved to In Progress

Apparently not enough, need to silence alerts and see what to do about it

Actions

#10

Updated by openqa_review 6 months ago

Due date set to 2024-07-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions

#11

Updated by okurz 6 months ago

Priority changed from Urgent to High

Silenced alert again for now.

I looked at https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=now-7d&to=now
and found that there are short-timed load spikes coinciding with significant, non-critical memory usage, low CPU usage but high I/O usage with maxing-out I/O times. As observed the past days during those times it seems that there are especially high-demanding openQA jobs with bigger HDDSIZE requests and according high I/O demands. I assume that we are not actually hitting typing issues in such cases but still stall test execution and trigger the observed alerts.

So I proposed
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/848
to reduce worker load limit 30->25.

Next idea: Limit number of "qemu_x86_64-large-mem" to a limited number of instances or introduce a new class and ask SAP-HA squad to schedule tests against those in particular.

Actions

#12

Updated by okurz 6 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/850

Actions

#13

Updated by okurz 6 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/853

Actions

#14

Updated by okurz 6 months ago

Status changed from In Progress to Feedback

waiting for review and feedback on https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/853

Actions

#15

Updated by okurz 6 months ago

Related to action #162719: Ensure w40 has more space for worker pool directories size:S added

Actions

#16

Updated by okurz 6 months ago

Due date deleted (~~2024-07-07~~)
Status changed from Feedback to Blocked
Priority changed from High to Normal

the proposal for "size" classes stands and I would like to give it more time to decide if that is the right approach. Then there is also the impact of space depletion which coincided with too high load so blocking on #162719

Actions

#17

Updated by okurz 5 months ago

Status changed from Blocked to Resolved

I kept https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/853 open because I think it's a valid idea but it's still undecided how much it would be maintainable. Besides that the alert is gone. I removed the silence accordingly.

Actions

Also available in: Atom PDF