action #160598
closed[alert] s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker size:S
0%
Description
Observation¶
Could be related to #158170? Did we allow to much instances?
Summary
System Load too high for a longer time, see https://progress.opensuse.org/issues/150983
Description
System Load is considered too high for a longer time. Machine possibly overloaded. Especially when there are too many openQA worker instances configured openQA tests would become flaky and showing lost characters or repeated characters in VNC typing.
Take a look which processes make the machine busy and look for corresponding openQA tests failing due to this situation and handle accordingly, e.g. retrigger the openQA tests after mitigating the root cause.
See
https://progress.opensuse.org/issues/150983
for details.
Values
B=79.57516129032257 C=1
Labels
alertname s390zl12: CPU load alert
grafana_folder openQA
host s390zl12
hostname s390zl12
origin salt
rule_uid cpu_load_alert_s390zl12
type worker
The issue is resolved at this moment, so no rollback steps needed and normal priority for now.
Suggestions¶
- Consider reducing the worker slots
- Check that the alert threshold is good, or adjust it
- Take a look at the logs from the timeframe of the alert firing
Updated by jbaier_cz 7 months ago
- Copied from action #153958: [alert] s390zl12: Memory usage alert Generic memory_usage_alert_s390zl12 generic added
Updated by jbaier_cz 7 months ago
- Related to action #158170: Increase resources for s390x kvm size:M added
Updated by jbaier_cz 7 months ago
- Has duplicate action #160730: [FIRING:1] s390zl12 (s390zl12: CPU load alert openQA s390zl12 salt cpu_load_alert_s390zl12 worker) added
Updated by jbaier_cz 7 months ago
There are no suspicious messages in the log around the problematic times. It seems that there is just too many jobs to be done at the same time. Let's try to disable a few worker slots and reiterate: https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/817
Updated by openqa_review 7 months ago
- Due date set to 2024-06-07
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 7 months ago
- Due date deleted (
2024-06-07)
I wonder what happened here. @jbaier_cz Did you make any progress? Maybe worth discussing in the unblock if there's open questions here.
Updated by jbaier_cz 7 months ago
I believe https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/817 is still not merged. Last time I looked there were some points I already targeted within an updated commit. So I am blocked here and waiting for a review / merge.