Project

General

Profile

action #150983

Updated by okurz 6 months ago

## Motivation 
 In #139271 I conducted a load test on mania. With intentionally too many worker instances configured openQA tests would become flaky and showing lost characters or repeated characters in VNC typing. Monitoring showed that during those times the CPU load was consistently very high, e.g. over 40, for even more than 1h. Also CPU usage accordingly was maxed out for a longer time. We should define alerts to prevent such situations going unnoticed. 

 ## Acceptance criteria 
 * **AC1:** openQA OSD workers alert if CPU load and/or usage exceed limits that go along with flaky tests due to overload 

 ## Suggestions 
 * Just add an alert because we already have monitoring panels for CPU usage+load for each worker just no alert (but we already have an alert for OSD webUI which we can use as reference) 
 * Add a description for the alert, e.g. based on the above motivation. And link this ticket. 
 * Ensure that the alert is deployed and active for all OSD workers 
 * Ensure that there are no related alerts firing, i.e. either there are no overload situations and there should be no alerts or machines are legitimately overloaded and should be handled accordingly.

Back