Actions
action #162602
openopenQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S
Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-20
Due date:
2024-07-07 (Due in 7 days)
% Done:
0%
Estimated time:
Description
Observation¶
With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596
Suggestions¶
- Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
- Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
- Lower the load limit
- Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
- Take #162596 into account
Rollback actions¶
- Remove alert
rule_uid=~load_alert_worker40
from https://monitor.qa.suse.de/alerting/silences
Actions