action #150983
closedCPU Load and usage alert for openQA workers size:S
0%
Description
Motivation¶
In #139271 I conducted a load test on mania. With intentionally too many worker instances configured openQA tests would become flaky and showing lost characters or repeated characters in VNC typing. Monitoring showed that during those times the CPU load was consistently very high, e.g. over 40, for even more than 1h. Also CPU usage accordingly was maxed out for a longer time. We should define alerts to prevent such situations going unnoticed.
Acceptance criteria¶
- AC1: openQA OSD workers alert if CPU load and/or usage exceed limits that go along with flaky tests due to overload
Suggestions¶
- Just add an alert because we already have monitoring panels for CPU usage+load for each worker just no alert (but we already have an alert for OSD webUI which we can use as reference)
- Add a description for the alert, e.g. based on the above motivation. And link this ticket.
- Ensure that the alert is deployed and active for all OSD workers
- Ensure that there are no related alerts firing, i.e. either there are no overload situations and there should be no alerts or machines are legitimately overloaded and should be handled accordingly.
Updated by okurz 11 months ago
- Copied from action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added
Updated by okurz 11 months ago
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1044 merged, no related alert showed up. On monitor.qe.nue2.suse.org we also realized that the salt minion wasn't running for some days.
First, mkittler created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1052 to give a worker-specific uid for the alert.
Updated by okurz 11 months ago
- Copied to action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M added
Updated by okurz 11 months ago
- Due date deleted (
2023-12-01) - Status changed from Feedback to Resolved
https://monitor.qa.suse.de/d/WDpetrol/worker-dashboard-petrol?viewPanel=54694&orgId=1&from=now-6h&to=now shows the alert. Same on other workers.