action #162596
open
openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens
openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers
[FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry
Added by okurz 27 days ago.
Updated 2 days ago.
Category:
Regressions/Crashes
Description
Observation¶
With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high partition usage. By now the high partition usage subsided again. We should investigate what caused the alert and prevent alerts being either false alerts or still need fixes.
Rollback steps¶
- Copied from action #162485: [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S added
- Copied to action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S added
- Status changed from New to Blocked
- Assignee set to okurz
- Status changed from Blocked to In Progress
worker40:/var/lib/openqa # df -h /var/lib/openqa/
Filesystem Size Used Avail Use% Mounted on
/dev/md127 470G 433G 14G 98% /var/lib/openqa
worker40:/var/lib/openqa # du -x -d1 -BG | sort -n
1G ./lost+found
59G ./cache
375G ./pool
433G .
worker40:/var/lib/openqa # du -x -d2 -BG | sort -n
…
40G ./pool/11
42G ./pool/21
46G ./pool/19
59G ./cache
59G ./cache/openqa.suse.de
376G ./pool
435G .
pool/19 belongs to currently running test
https://openqa.suse.de/tests/14690704 which has quite heavy
HDDSIZEGB 60
HDDSIZEGB_2 131
other partitions would have more space available
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme0n1 259:1 0 5.8T 0 disk
├─nvme0n1p1 259:2 0 512M 0 part /boot/efi
├─nvme0n1p2 259:3 0 5.8T 0 part /var
…
│ /
└─nvme0n1p3 259:4 0 1G 0 part [SWAP]
nvme2n1 259:5 0 476.9G 0 disk
└─md127 9:127 0 476.8G 0 raid0 /var/lib/openqa
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/847 reducing instances 49->46
- Due date set to 2024-07-05
- Status changed from In Progress to Feedback
- Due date deleted (
2024-07-05)
- Status changed from Feedback to Resolved
- Subject changed from [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) to [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry
- Status changed from Resolved to In Progress
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/851
and triggered openqa-label-known-issues and openqa-advanced-retrigger
export host=openqa.suse.de; failed_since="'2024-06-27'" result="'incomplete'" ./openqa-monitor-investigation-candidates | ./openqa-label-known-issues-multi
and
host=openqa.suse.de failed_since="2024-06-27 07:00" result="result='incomplete'" additional_filters="reason like '%terminated prematurely%'" comment="label:poo#162596" ./openqa-advanced-retrigger-jobs
- Description updated (diff)
- Status changed from In Progress to Feedback
- Status changed from Feedback to Blocked
Also available in: Atom
PDF