Project

General

Profile

Actions

action #162602

closed

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S

Added by okurz 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-06-20
Due date:
% Done:

0%

Estimated time:

Description

Observation

With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high CPU load and one about a partition getting full. Similar to #162596

Suggestions

  • Maybe the high CPU load was caused by the lack of space - which is tracked in #162596
  • Are tests passing successfully on worker40? - If it doesn't look like we have typing or similar issues, bump the alert threshold.
  • Lower the load limit
  • Check the number of worker slots and e.g. reduce according to the load - maybe we didn't notice the capacity was already too high before
  • Take #162596 into account

Rollback actions


Related issues 3 (0 open3 closed)

Related to openQA Infrastructure (public) - action #162719: Ensure w40 has more space for worker pool directories size:SResolveddheidler2024-06-21

Actions
Copied from openQA Infrastructure (public) - action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker)Resolvedlivdywan

Actions
Copied to openQA Infrastructure (public) - action #162605: [FIRING:1] CPU load alert, should be "system load"Resolvedokurz2024-06-20

Actions
Actions

Also available in: Atom PDF