Project

General

Profile

Actions

action #150983

closed

CPU Load and usage alert for openQA workers size:S

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #139271 I conducted a load test on mania. With intentionally too many worker instances configured openQA tests would become flaky and showing lost characters or repeated characters in VNC typing. Monitoring showed that during those times the CPU load was consistently very high, e.g. over 40, for even more than 1h. Also CPU usage accordingly was maxed out for a longer time. We should define alerts to prevent such situations going unnoticed.

Acceptance criteria

  • AC1: openQA OSD workers alert if CPU load and/or usage exceed limits that go along with flaky tests due to overload

Suggestions

  • Just add an alert because we already have monitoring panels for CPU usage+load for each worker just no alert (but we already have an alert for OSD webUI which we can use as reference)
  • Add a description for the alert, e.g. based on the above motivation. And link this ticket.
  • Ensure that the alert is deployed and active for all OSD workers
  • Ensure that there are no related alerts firing, i.e. either there are no overload situations and there should be no alerts or machines are legitimately overloaded and should be handled accordingly.

Related issues 2 (0 open2 closed)

Copied from openQA Infrastructure (public) - action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:MResolvedokurz2023-09-20

Actions
Copied to openQA Infrastructure (public) - action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:MRejectedokurz2023-11-28

Actions
Actions

Also available in: Atom PDF