action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Custom queries

openQA Infrastructure Project
openqa-review - Closed tickets last updated by openqa-review, last 30 days
QA roadmap long-term
QA SLE functional
QA SLE Functional - closed in last 14 days
QA SLE Functional - High, need to be refined
QA SLE Functional - over cycle time median
QA SLE u
QA SLE y
QA tools (tag not necessary in openQA and subprojects)
QA tools tag (tag not necessary in openQA and subprojects; excluding tickets in "Ready" version as they are already on the backlog)
QAC - Backlog
QE tools team - backlog (dev)
QE tools team - backlog (ready issues)
QE tools team - backlog SLA high
QE tools team - backlog SLA immediate
QE tools team - backlog SLA no immediate/urgent in feedback/blocked
QE tools team - backlog SLA normal
QE tools team - backlog SLA urgent
QE tools team - backlog SLO high
QE tools team - backlog SLO normal
QE tools team - backlog SLO urgent
QE tools team - backlog, high-level view (epics and higher)
QE tools team - backlog, non-reactive work, needs parent
QE tools team - backlog, top-level view (all sagas)
QE tools team - closed within last 14 days
QE tools team - closed within last 60 days
QE tools team - closed yesterday
QE Tools Team - Collaborative Session
QE tools team - due date forecast
QE tools team - exceeding due-date
QE tools team - infrastructure backlog
QE tools team - next - sorted by update time
QE tools team - next issues
QE tools team - non-estimated (unblocked) issues (dev)
QE tools team - non-estimated (unblocked) issues (infra)
QE tools team - ready issues - Workable
QE tools team - ready, not assigned/blocked/low
QE tools team - SLO high forecast
QE tools team - update forecast
QE tools team - updated by priority
QE tools team - what members of the team are working on - Feedback (not-low)
QE Tools Team Backlog By Assignee
Tools Team Retrospective
Tools Team Retrospective (not estimated or assigned)

Actions

action #162602

closed

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S

Added by okurz 6 months ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-06-20

Due date:

% Done:

Estimated time:

Tags:

alert, osd, infra, worker40

Description

Observation¶

With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596

Suggestions¶

Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
Lower the load limit
Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
Take #162596 into account

Rollback actions¶

Remove alert rule_uid=~load_alert_worker40 from https://monitor.qa.suse.de/alerting/silences

Related issues 3 (0 open — 3 closed)

Related to openQA Infrastructure (public) - action #162719: Ensure w40 has more space for worker pool directories size:S

Resolved

dheidler

2024-06-21

Actions

Copied from openQA Infrastructure (public) - action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker)

Resolved

livdywan

Actions

Copied to openQA Infrastructure (public) - action #162605: [FIRING:1] CPU load alert, should be "system load"

Resolved

okurz

2024-06-20

Actions

Issue # Delay: days Cancel

History
Notes
Property changes

Actions

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #162602

[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S

Observation¶

Suggestions¶

Rollback actions¶

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by livdywan 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by livdywan 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by openqa_review 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 5 months ago