Project

General

Profile

Actions

action #150983

closed

CPU Load and usage alert for openQA workers size:S

Added by okurz about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

In #139271 I conducted a load test on mania. With intentionally too many worker instances configured openQA tests would become flaky and showing lost characters or repeated characters in VNC typing. Monitoring showed that during those times the CPU load was consistently very high, e.g. over 40, for even more than 1h. Also CPU usage accordingly was maxed out for a longer time. We should define alerts to prevent such situations going unnoticed.

Acceptance criteria

  • AC1: openQA OSD workers alert if CPU load and/or usage exceed limits that go along with flaky tests due to overload

Suggestions

  • Just add an alert because we already have monitoring panels for CPU usage+load for each worker just no alert (but we already have an alert for OSD webUI which we can use as reference)
  • Add a description for the alert, e.g. based on the above motivation. And link this ticket.
  • Ensure that the alert is deployed and active for all OSD workers
  • Ensure that there are no related alerts firing, i.e. either there are no overload situations and there should be no alerts or machines are legitimately overloaded and should be handled accordingly.

Related issues 2 (0 open2 closed)

Copied from openQA Infrastructure (public) - action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:MResolvedokurz2023-09-20

Actions
Copied to openQA Infrastructure (public) - action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:MRejectedokurz2023-11-28

Actions
Actions #1

Updated by okurz about 1 year ago

  • Copied from action #139271: Repurpose PowerPC hardware in FC Basement - mania Power8 PowerPC size:M added
Actions #2

Updated by okurz about 1 year ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from future to Ready
Actions #3

Updated by okurz about 1 year ago

  • Due date set to 2023-12-01
  • Status changed from In Progress to Feedback
Actions #4

Updated by okurz about 1 year ago

  • Subject changed from CPU Load and usage alert for openQA workers to CPU Load and usage alert for openQA workers size:S
  • Description updated (diff)
Actions #5

Updated by okurz about 1 year ago

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1044 merged, no related alert showed up. On monitor.qe.nue2.suse.org we also realized that the salt minion wasn't running for some days.

First, mkittler created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1052 to give a worker-specific uid for the alert.

Actions #6

Updated by okurz about 1 year ago

  • Copied to action #151588: [potential-regression] Our salt node up check in osd-deployment never fails size:M added
Actions #7

Updated by okurz about 1 year ago

  • Due date deleted (2023-12-01)
  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF