action #164284: [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #164284

closed

[FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S

Added by livdywan 10 months ago. Updated 10 months ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

Estimated time:

Tags:

alert, osd, infra, reactive work, worker-arm-1

Description

Observation¶

The load was exceeding our expected limits for 15 minutes, see https://stats.openqa-monitor.qa.suse.de/d/WDworker-arm1/worker-dashboard-worker-arm1?orgId=1

Acceptance Criteria¶

AC1: No alerts about high load for normal openQA workloads on worker-arm1

Suggestions¶

Look for cues on what caused the high load at the time
Let's not increase the load limits in grafana for now
- Confirm that no jobs were failing or incomplete because of the load
- Decrease the load limits in the worker i.e. workerconf.sls

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 10 months ago

Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by livdywan 10 months ago

Subject changed from [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) to [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by livdywan 10 months ago

Status changed from Workable to In Progress
Assignee set to livdywan

I'm taking a look now

Actions

Copy link

Updated by livdywan 10 months ago

Description updated (diff)
Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/876

Actions

Copy link

Updated by nicksinger 10 months ago

I wonder if we might need to look closer into what causes this. We're having a 48 core CPU here not being able to handle 10 worker-instances which seems odd to me. But it might be expected if each instance requires a lot of resources (which I haven't checked yet)

Actions

Copy link

Updated by okurz 10 months ago

Due date set to 2024-08-12

Actions

Copy link

Updated by livdywan 10 months ago

nicksinger wrote in #note-5:

I wonder if we might need to look closer into what causes this. We're having a 48 core CPU here not being able to handle 10 worker-instances which seems odd to me. But it might be expected if each instance requires a lot of resources (which I haven't checked yet)

CRITICAL_LOAD_AVG_THRESHOLD: 16

I was made aware that we have this feature, hence changing to this instead of reducing the number of workers.

Actions

Copy link

Updated by okurz 10 months ago

Due date deleted (~~2024-08-12~~)
Status changed from Feedback to Resolved

MR is effective. Looking on https://stats.openqa-monitor.qa.suse.de/d/WDworker-arm1/worker-dashboard-worker-arm1?orgId=1&from=1721552583025&to=1722586212383&viewPanel=54694 I see that the load still reaches up to 50 (!). Possibly a load limit of 16 won't be enough for long. If that turns out to be true I suggest to reduce the limit further or learn what makes ARM special. But we already know that ARM, in particular those older machines, are special.

Actions

Copy link

Updated by nicksinger 2 months ago

Copied to action #179497: [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #164284

[FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S

Observation¶

Acceptance Criteria¶

Suggestions¶

Updated by okurz 10 months ago

Updated by livdywan 10 months ago

Updated by livdywan 10 months ago

Updated by livdywan 10 months ago

Updated by nicksinger 10 months ago

Updated by okurz 10 months ago

Updated by livdywan 10 months ago

Updated by okurz 10 months ago

Updated by nicksinger 2 months ago