action #162596: [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #162596

closed

openQA Project (public) - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project (public) - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker)

Added by okurz 12 months ago. Updated 9 months ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

0%

Estimated time:

Tags:

alert, osd, infra, worker40

Description

Observation¶

With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high partition usage. By now the high partition usage subsided again. We should investigate what caused the alert and prevent alerts being either false alerts or still need fixes.

Rollback steps¶

remove alert silence rule_uid=partitions_usage_alert_worker40 on https://monitor.qa.suse.de/alerting/silences

Related issues 3 (1 open — 2 closed)

Actions

#1

Updated by okurz 12 months ago

Copied from action #162485: [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S added

Actions

#2

Updated by okurz 12 months ago

Copied to action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S added

Actions

#3

Updated by okurz 12 months ago

Status changed from New to Blocked
Assignee set to okurz

Actions

#4

Updated by okurz 12 months ago

Status changed from Blocked to In Progress

worker40:/var/lib/openqa # df -h /var/lib/openqa/
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      470G  433G   14G  98% /var/lib/openqa
worker40:/var/lib/openqa # du -x -d1 -BG | sort -n
1G	./lost+found
59G	./cache
375G	./pool
433G	.

Actions

#5

Updated by okurz 12 months ago

worker40:/var/lib/openqa # du -x -d2 -BG | sort -n
…
40G	./pool/11
42G	./pool/21
46G	./pool/19
59G	./cache
59G	./cache/openqa.suse.de
376G	./pool
435G	.

pool/19 belongs to currently running test

https://openqa.suse.de/tests/14690704 which has quite heavy

HDDSIZEGB 	60
HDDSIZEGB_2 	131

other partitions would have more space available

# lsblk 
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
nvme0n1     259:1    0   5.8T  0 disk  
├─nvme0n1p1 259:2    0   512M  0 part  /boot/efi
├─nvme0n1p2 259:3    0   5.8T  0 part  /var
…
│                                      /
└─nvme0n1p3 259:4    0     1G  0 part  [SWAP]
nvme2n1     259:5    0 476.9G  0 disk  
└─md127       9:127  0 476.8G  0 raid0 /var/lib/openqa

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/847 reducing instances 49->46

Actions

#6

Updated by okurz 12 months ago

Related to coordination #162716: [epic] Better use of storage on OSD workers added

Actions

#7

Updated by okurz 12 months ago

Due date set to 2024-07-05
Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/847 merged and applied. Let's see how much that helps.

Actions

#8

Updated by okurz 12 months ago

Due date deleted (~~2024-07-05~~)
Status changed from Feedback to Resolved

Actions

#9

Updated by okurz 11 months ago

Subject changed from [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) to [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry
Status changed from Resolved to In Progress

happening stlil repeatedly causing incomplete jobs like https://openqa.suse.de/tests/14737966

Actions

#10

Updated by okurz 11 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/851

and triggered openqa-label-known-issues and openqa-advanced-retrigger

export host=openqa.suse.de; failed_since="'2024-06-27'" result="'incomplete'" ./openqa-monitor-investigation-candidates | ./openqa-label-known-issues-multi

and

host=openqa.suse.de failed_since="2024-06-27 07:00" result="result='incomplete'" additional_filters="reason like '%terminated prematurely%'" comment="label:poo#162596" ./openqa-advanced-retrigger-jobs

Actions

#11

Updated by okurz 11 months ago

Description updated (diff)

Actions

#12

Updated by okurz 11 months ago

Status changed from In Progress to Feedback

Actions

#13

Updated by okurz 11 months ago

https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=1719900063344&to=1719949375574&viewPanel=65090 shows that we still have too high partition usage. Need to block on #162719

Actions

#14

Updated by okurz 11 months ago

Status changed from Feedback to Blocked

https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=1719900063344&to=1719949375574&viewPanel=65090 shows that we still have too high partition usage. Need to block on #162719

Actions

#15

Updated by livdywan 11 months ago

okurz wrote in #note-14:

https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=1719900063344&to=1719949375574&viewPanel=65090 shows that we still have too high partition usage. Need to block on #162719

Work on the blocker starting now

Actions

#16

Updated by livdywan 11 months ago

livdywan wrote in #note-15:

okurz wrote in #note-14:

https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=1719900063344&to=1719949375574&viewPanel=65090 shows that we still have too high partition usage. Need to block on #162719

Work on the blocker starting now

Not just yet. But I bumped the priority to match.

Actions

#17

Updated by livdywan 11 months ago

Status changed from Blocked to New

#162719#note-17 was resolved!

Actions

#18

Updated by okurz 11 months ago

Status changed from New to Resolved

partition usage now below 10%. Silence removed.

Actions

#19

Updated by openqa_review 10 months ago

Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: xfstests_btrfs-btrfs-201-999
https://openqa.opensuse.org/tests/4380095#step/btrfs-213/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

#20

Updated by okurz 10 months ago · Edited

Subject changed from [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry to [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker)
Status changed from Feedback to Resolved

The referenced job is doing filesystem tests and probably even on purpose making it run out of space:

ERROR: error during balancing /opt/scratch: No space left on device

I did the mistake to not refine or remove the auto_review expression.

Actions

#21

Updated by openqa_review 10 months ago

Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: slmicro-xfstests_btrfs-generic-601-999
https://openqa.suse.de/tests/15224463#step/generic-746/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

#22

Updated by okurz 10 months ago

Status changed from Feedback to New
Assignee deleted (~~okurz~~)

Actions

#23

Updated by livdywan 10 months ago

Status changed from New to Resolved
Assignee set to livdywan

I checked the job, and it's not the original issue but erroneously linked like the previous job mentioned in #162596#note-19 hence resolving again.

Actions

#24

Updated by openqa_review 9 months ago

Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: xfstests_btrfs-btrfs-201-999
https://openqa.opensuse.org/tests/4452821#step/btrfs-213/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

#25

Updated by livdywan 9 months ago

Status changed from Feedback to Resolved

This bug is still referenced in a failing openQA test: xfstests_btrfs-btrfs-201-999
https://openqa.opensuse.org/tests/4452821#step/btrfs-213/1

This is supposedly a carrover from https://openqa.opensuse.org/tests/4303561 which was already deleted. Not sure how this is supposed to work. Either way it doesn't look related.

Actions

Also available in: Atom PDF