Project

General

Profile

Actions

action #162596

open

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry

Added by okurz 27 days ago. Updated 2 days ago.

Status:
Blocked
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high partition usage. By now the high partition usage subsided again. We should investigate what caused the alert and prevent alerts being either false alerts or still need fixes.

Rollback steps


Related issues 3 (2 open1 closed)

Related to openQA Infrastructure - coordination #162716: [epic] Better use of storage on OSD workersNew2024-06-21

Actions
Copied from openQA Infrastructure - action #162485: [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:SResolvedokurz2024-06-19

Actions
Copied to openQA Infrastructure - action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:SBlockedokurz2024-06-20

Actions
Actions

Also available in: Atom PDF