action #162602
closedopenQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
0%
Description
Observation¶
With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596
Suggestions¶
- Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
- Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
- Lower the load limit
- Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
- Take #162596 into account
Rollback actions¶
- Remove alert
rule_uid=~load_alert_worker40
from https://monitor.qa.suse.de/alerting/silences
Updated by okurz 6 months ago
- Copied from action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) added
Updated by okurz 6 months ago
- Copied to action #162605: [FIRING:1] CPU load alert, should be "system load" added
Updated by livdywan 6 months ago
- Subject changed from [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) to [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by openqa_review 6 months ago
- Due date set to 2024-07-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz 6 months ago
- Priority changed from Urgent to High
Silenced alert again for now.
I looked at https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=now-7d&to=now
and found that there are short-timed load spikes coinciding with significant, non-critical memory usage, low CPU usage but high I/O usage with maxing-out I/O times. As observed the past days during those times it seems that there are especially high-demanding openQA jobs with bigger HDDSIZE requests and according high I/O demands. I assume that we are not actually hitting typing issues in such cases but still stall test execution and trigger the observed alerts.
So I proposed
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/848
to reduce worker load limit 30->25.
Next idea: Limit number of "qemu_x86_64-large-mem" to a limited number of instances or introduce a new class and ask SAP-HA squad to schedule tests against those in particular.
Updated by okurz 6 months ago
- Status changed from In Progress to Feedback
waiting for review and feedback on https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/853
Updated by okurz 6 months ago
- Related to action #162719: Ensure w40 has more space for worker pool directories size:S added
Updated by okurz 6 months ago
- Due date deleted (
2024-07-07) - Status changed from Feedback to Blocked
- Priority changed from High to Normal
the proposal for "size" classes stands and I would like to give it more time to decide if that is the right approach. Then there is also the impact of space depletion which coincided with too high load so blocking on #162719
Updated by okurz 5 months ago
- Status changed from Blocked to Resolved
I kept https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/853 open because I think it's a valid idea but it's still undecided how much it would be maintainable. Besides that the alert is gone. I removed the silence accordingly.