Project

General

Profile

Actions

action #158113

closed

openQA Project (public) - coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

openQA Project (public) - coordination #158110: [epic] Prevent worker overload

typing issue on ppc64 worker - make CPU load alert more strict size:M

Added by okurz 9 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Regressions/Crashes
Start date:
2024-03-27
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Motivation

#158104 shows VNC typing issues. For this in #150983 on purpose we added alerts to alert on too high CPU load. https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?orgId=1&from=now-2d&to=now&viewPanel=54694 clearly shows a load consistently in the range of 50-70(!) for mania but no alert triggered. We should crosscheck https://monitor.qa.suse.de/alerting/cpu_load_alert_mania/modify-export?returnTo=%2Fd%2FWDmania%2Fworker-dashboard-mania%3ForgId%3D1%26from%3Dnow-7d%26to%3Dnow%26viewPanel%3D54694%26editPanel%3D54694%26tab%3Dalert
and make that alert more strict.

Acceptance criteria

  • AC1: CPU load alerts trigger for a CPU load15 consistently above 40 as originally planned

Suggestions


Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #158104: typing issue on ppc64 worker size:SResolvedokurz2024-03-27

Actions
Actions #1

Updated by okurz 9 months ago

  • Copied from action #158104: typing issue on ppc64 worker size:S added
Actions #2

Updated by okurz 9 months ago

  • Subject changed from typing issue on ppc64 worker to typing issue on ppc64 worker - make CPU load alert more strict
Actions #3

Updated by okurz 9 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #4

Updated by okurz 9 months ago

  • Due date set to 2024-04-10
  • Status changed from In Progress to Feedback
Actions #5

Updated by okurz 9 months ago

Actions #6

Updated by okurz 9 months ago ยท Edited

got help. https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1137 to ignore NaN in alert evaluation. Will need more days to monitor the impact then.

https://monitor.qa.suse.de/d/WDmania/worker-dashboard-mania?viewPanel=54694&orgId=1&from=1711742324943&to=1711831309898 looks like a post-evaluation of how alerts would be triggered and shows that actually alerts would have triggered. Let's see if more systems actually trigger alerts in the upcoming days.

Actions #7

Updated by okurz 9 months ago

  • Due date deleted (2024-04-10)
  • Status changed from Feedback to Resolved

No alerts received for a day but considered good. Next overload very likely does trigger alerts.

Actions #8

Updated by okurz 9 months ago

  • Status changed from Resolved to New
  • Assignee deleted (okurz)
Actions #9

Updated by okurz 9 months ago

  • Subject changed from typing issue on ppc64 worker - make CPU load alert more strict to typing issue on ppc64 worker - make CPU load alert more strict size:M
  • Description updated (diff)
Actions #10

Updated by okurz 9 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
Actions #12

Updated by okurz 9 months ago

  • Due date set to 2024-04-22
  • Status changed from In Progress to Feedback
  • Priority changed from High to Low
Actions #13

Updated by okurz 9 months ago

alerts were triggered so it works. Bumped threshold with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1143

Actions #14

Updated by okurz 8 months ago

  • Due date deleted (2024-04-22)
  • Status changed from Feedback to Resolved

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1143 merged. Alert received which will be followed up with by mkittler, the alert is valid.

Actions

Also available in: Atom PDF