action #164284
closed[FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S
0%
Description
Observation¶
The load was exceeding our expected limits for 15 minutes, see https://stats.openqa-monitor.qa.suse.de/d/WDworker-arm1/worker-dashboard-worker-arm1?orgId=1
Acceptance Criteria¶
- AC1: No alerts about high load for normal openQA workloads on worker-arm1
Suggestions¶
- Look for cues on what caused the high load at the time
- Let's not increase the load limits in grafana for now
- Confirm that no jobs were failing or incomplete because of the load
- Decrease the load limits in the worker i.e. workerconf.sls
Updated by livdywan 9 months ago
- Subject changed from [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) to [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger 8 months ago
I wonder if we might need to look closer into what causes this. We're having a 48 core CPU here not being able to handle 10 worker-instances which seems odd to me. But it might be expected if each instance requires a lot of resources (which I haven't checked yet)
Updated by livdywan 8 months ago
nicksinger wrote in #note-5:
I wonder if we might need to look closer into what causes this. We're having a 48 core CPU here not being able to handle 10 worker-instances which seems odd to me. But it might be expected if each instance requires a lot of resources (which I haven't checked yet)
CRITICAL_LOAD_AVG_THRESHOLD: 16
I was made aware that we have this feature, hence changing to this instead of reducing the number of workers.
Updated by okurz 8 months ago
- Due date deleted (
2024-08-12) - Status changed from Feedback to Resolved
MR is effective. Looking on https://stats.openqa-monitor.qa.suse.de/d/WDworker-arm1/worker-dashboard-worker-arm1?orgId=1&from=1721552583025&to=1722586212383&viewPanel=54694 I see that the load still reaches up to 50 (!). Possibly a load limit of 16 won't be enough for long. If that turns out to be true I suggest to reduce the limit further or learn what makes ARM special. But we already know that ARM, in particular those older machines, are special.
Updated by nicksinger 15 days ago
- Copied to action #179497: [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) added