Project

General

Profile

Actions

action #162602

open

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S

Added by okurz 10 days ago. Updated 2 days ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-06-20
Due date:
2024-07-07 (Due in 7 days)
% Done:

0%

Estimated time:

Description

Observation

With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596

Suggestions

  • Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
  • Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
  • Lower the load limit
  • Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
  • Take #162596 into account

Rollback actions


Related issues 2 (2 open0 closed)

Copied from openQA Infrastructure - action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retryFeedbackokurz

Actions
Copied to openQA Infrastructure - action #162605: [FIRING:1] CPU load alert, should be "system load"Feedbackokurz2024-06-202024-07-05

Actions
Actions

Also available in: Atom PDF