Project

General

Profile

Actions

action #160598

closed

[alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S

Added by jbaier_cz 7 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

Could be related to #158170? Did we allow to much instances?

Summary
System Load too high for a longer time, see https://progress.opensuse.org/issues/150983
Description
System Load is considered too high for a longer time. Machine possibly overloaded. Especially when there are too many openQA worker instances configured openQA tests would become flaky and showing lost characters or repeated characters in VNC typing.

Take a look which processes make the machine busy and look for corresponding openQA tests failing due to this situation and handle accordingly, e.g. retrigger the openQA tests after mitigating the root cause.

See
https://progress.opensuse.org/issues/150983
for details.
Values
B=79.57516129032257  C=1 
Labels
alertname         s390zl12: CPU load alert
grafana_folder         openQA
host         s390zl12
hostname         s390zl12
origin         salt
rule_uid         cpu_load_alert_s390zl12
type         worker

https://stats.openqa-monitor.qa.suse.de/d/WDs390zl12/worker-dashboard-s390zl12?orgId=1&from=1715990201255&to=1716033674370

The issue is resolved at this moment, so no rollback steps needed and normal priority for now.

Suggestions

  • Consider reducing the worker slots
  • Check that the alert threshold is good, or adjust it
  • Take a look at the logs from the timeframe of the alert firing

Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #158170: Increase resources for s390x kvm size:MResolvednicksinger2024-03-27

Actions
Has duplicate openQA Infrastructure (public) - action #160730: [FIRING:1] s390zl12 (s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker)Rejected2024-05-08

Actions
Copied from openQA Infrastructure (public) - action #153958: [alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 genericResolvedokurz2024-01-19

Actions
Actions

Also available in: Atom PDF