Project

General

Profile

Actions

action #164284

closed

[FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S

Added by livdywan 4 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

The load was exceeding our expected limits for 15 minutes, see https://stats.openqa-monitor.qa.suse.de/d/WDworker-arm1/worker-dashboard-worker-arm1?orgId=1

Acceptance Criteria

  • AC1: No alerts about high load for normal openQA workloads on worker-arm1

Suggestions

  • Look for cues on what caused the high load at the time
  • Let's not increase the load limits in grafana for now
    • Confirm that no jobs were failing or incomplete because of the load
    • Decrease the load limits in the worker i.e. workerconf.sls
Actions #1

Updated by okurz 4 months ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by livdywan 4 months ago

  • Subject changed from [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) to [FIRING:1] worker-arm1 (worker-arm1: System load alert openQA worker-arm1 salt system_load_alert_worker-arm1 worker) size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by livdywan 4 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

I'm taking a look now

Actions #4

Updated by livdywan 4 months ago

  • Description updated (diff)
  • Status changed from In Progress to Feedback
Actions #5

Updated by nicksinger 4 months ago

I wonder if we might need to look closer into what causes this. We're having a 48 core CPU here not being able to handle 10 worker-instances which seems odd to me. But it might be expected if each instance requires a lot of resources (which I haven't checked yet)

Actions #6

Updated by okurz 4 months ago

  • Due date set to 2024-08-12
Actions #7

Updated by livdywan 4 months ago

nicksinger wrote in #note-5:

I wonder if we might need to look closer into what causes this. We're having a 48 core CPU here not being able to handle 10 worker-instances which seems odd to me. But it might be expected if each instance requires a lot of resources (which I haven't checked yet)

CRITICAL_LOAD_AVG_THRESHOLD: 16

I was made aware that we have this feature, hence changing to this instead of reducing the number of workers.

Actions #8

Updated by okurz 4 months ago

  • Due date deleted (2024-08-12)
  • Status changed from Feedback to Resolved

MR is effective. Looking on https://stats.openqa-monitor.qa.suse.de/d/WDworker-arm1/worker-dashboard-worker-arm1?orgId=1&from=1721552583025&to=1722586212383&viewPanel=54694 I see that the load still reaches up to 50 (!). Possibly a load limit of 16 won't be enough for long. If that turns out to be true I suggest to reduce the limit further or learn what makes ARM special. But we already know that ARM, in particular those older machines, are special.

Actions

Also available in: Atom PDF