Project

General

Profile

Actions

action #160598

closed

[alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S

Added by jbaier_cz 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Could be related to #158170? Did we allow to much instances?

Summary
System Load too high for a longer time, see https://progress.opensuse.org/issues/150983
Description
System Load is considered too high for a longer time. Machine possibly overloaded. Especially when there are too many openQA worker instances configured openQA tests would become flaky and showing lost characters or repeated characters in VNC typing.

Take a look which processes make the machine busy and look for corresponding openQA tests failing due to this situation and handle accordingly, e.g. retrigger the openQA tests after mitigating the root cause.

See
https://progress.opensuse.org/issues/150983
for details.
Values
B=79.57516129032257  C=1 
Labels
alertname         s390zl12: CPU load alert
grafana_folder         openQA
host         s390zl12
hostname         s390zl12
origin         salt
rule_uid         cpu_load_alert_s390zl12
type         worker

https://stats.openqa-monitor.qa.suse.de/d/WDs390zl12/worker-dashboard-s390zl12?orgId=1&from=1715990201255&to=1716033674370

The issue is resolved at this moment, so no rollback steps needed and normal priority for now.

Suggestions

  • Consider reducing the worker slots
  • Check that the alert threshold is good, or adjust it
  • Take a look at the logs from the timeframe of the alert firing

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure - action #158170: Increase resources for s390x kvm size:MResolvednicksinger2024-03-27

Actions
Has duplicate openQA Infrastructure - action #160730: [FIRING:1] s390zl12 (s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker)Rejected2024-05-08

Actions
Copied from openQA Infrastructure - action #153958: [alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 genericResolvedokurz2024-01-19

Actions
Actions #1

Updated by jbaier_cz 6 months ago

  • Copied from action #153958: [alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 generic added
Actions #2

Updated by jbaier_cz 6 months ago

  • Related to action #158170: Increase resources for s390x kvm size:M added
Actions #3

Updated by jbaier_cz 6 months ago

  • Has duplicate action #160730: [FIRING:1] s390zl12 (s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker) added
Actions #4

Updated by livdywan 6 months ago

  • Subject changed from [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker to [alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by jbaier_cz 6 months ago

  • Assignee set to jbaier_cz
Actions #6

Updated by jbaier_cz 6 months ago

There are no suspicious messages in the log around the problematic times. It seems that there is just too many jobs to be done at the same time. Let's try to disable a few worker slots and reiterate: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/817

Actions #7

Updated by jbaier_cz 6 months ago

  • Status changed from Workable to In Progress
Actions #8

Updated by openqa_review 6 months ago

  • Due date set to 2024-06-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by jbaier_cz 6 months ago

  • Status changed from In Progress to Workable
Actions #10

Updated by livdywan 6 months ago

  • Due date deleted (2024-06-07)

I wonder what happened here. @jbaier_cz Did you make any progress? Maybe worth discussing in the unblock if there's open questions here.

Actions #11

Updated by jbaier_cz 6 months ago

I believe https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/817 is still not merged. Last time I looked there were some points I already targeted within an updated commit. So I am blocked here and waiting for a review / merge.

Actions #12

Updated by jbaier_cz 5 months ago

  • Status changed from Workable to Resolved

No more alerts after the merge so far, lets hope this will be enough and reopen if not (we can still disable more slots if needed).

Actions

Also available in: Atom PDF