action #164284
closed[FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S
0%
Description
Observation¶
The load was exceeding our expected limits for 15 minutes, see https://stats.openqa-monitor.qa.suse.de/d/WDworker-arm1/worker-dashboard-worker-arm1?orgId=1
Acceptance Criteria¶
- AC1: No alerts about high load for normal openQA workloads on worker-arm1
Suggestions¶
- Look for cues on what caused the high load at the time
- Let's not increase the load limits in grafana for now
- Confirm that no jobs were failing or incomplete because of the load
- Decrease the load limits in the worker i.e. workerconf.sls
Updated by livdywan 4 months ago
- Subject changed from [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) to [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger 4 months ago
I wonder if we might need to look closer into what causes this. We're having a 48 core CPU here not being able to handle 10 worker-instances which seems odd to me. But it might be expected if each instance requires a lot of resources (which I haven't checked yet)
Updated by livdywan 4 months ago
nicksinger wrote in #note-5:
I wonder if we might need to look closer into what causes this. We're having a 48 core CPU here not being able to handle 10 worker-instances which seems odd to me. But it might be expected if each instance requires a lot of resources (which I haven't checked yet)
CRITICAL_LOAD_AVG_THRESHOLD: 16
I was made aware that we have this feature, hence changing to this instead of reducing the number of workers.
Updated by okurz 4 months ago
- Due date deleted (
2024-08-12) - Status changed from Feedback to Resolved
MR is effective. Looking on https://stats.openqa-monitor.qa.suse.de/d/WDworker-arm1/worker-dashboard-worker-arm1?orgId=1&from=1721552583025&to=1722586212383&viewPanel=54694 I see that the load still reaches up to 50 (!). Possibly a load limit of 16 won't be enough for long. If that turns out to be true I suggest to reduce the limit further or learn what makes ARM special. But we already know that ARM, in particular those older machines, are special.