action #162602
openopenQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
0%
Description
Observation¶
With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596
Suggestions¶
- Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
- Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
- Lower the load limit
- Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
- Take #162596 into account
Rollback actions¶
- Remove alert
rule_uid=~load_alert_worker40
from https://monitor.qa.suse.de/alerting/silences
Updated by okurz 10 days ago
- Copied from action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry added
Updated by okurz 10 days ago
- Copied to action #162605: [FIRING:1] CPU load alert, should be "system load" added
Updated by livdywan 10 days ago
- Subject changed from [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) to [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by openqa_review 7 days ago
- Due date set to 2024-07-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 6 days ago
- Priority changed from Urgent to High
Silenced alert again for now.
I looked at https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=now-7d&to=now
and found that there are short-timed load spikes coinciding with significant, non-critical memory usage, low CPU usage but high I/O usage with maxing-out I/O times. As observed the past days during those times it seems that there are especially high-demanding openQA jobs with bigger HDDSIZE requests and according high I/O demands. I assume that we are not actually hitting typing issues in such cases but still stall test execution and trigger the observed alerts.
So I proposed
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/848
to reduce worker load limit 30->25.
Next idea: Limit number of "qemu_x86_64-large-mem" to a limited number of instances or introduce a new class and ask SAP-HA squad to schedule tests against those in particular.