Project

General

Profile

Actions

action #162596

open

openQA Project - coordination #112862: [saga][epic] Future ideas for easy multi-machine handling: MM-tests as first-class citizens

openQA Project - coordination #111929: [epic] Stable multi-machine tests covering multiple physical workers

[FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry

Added by okurz 27 days ago. Updated 2 days ago.

Status:
Blocked
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

With #162374 w40 (worker40.oqa.prg2.suse.org) is the only OSD PRG2 x86_64 tap worker and due to the openQA job queue size w40 is executing openQA jobs near-continuously. Now an alert triggered about too high partition usage. By now the high partition usage subsided again. We should investigate what caused the alert and prevent alerts being either false alerts or still need fixes.

Rollback steps


Related issues 3 (2 open1 closed)

Related to openQA Infrastructure - coordination #162716: [epic] Better use of storage on OSD workersNew2024-06-21

Actions
Copied from openQA Infrastructure - action #162485: [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:SResolvedokurz2024-06-19

Actions
Copied to openQA Infrastructure - action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:SBlockedokurz2024-06-20

Actions
Actions #1

Updated by okurz 27 days ago

  • Copied from action #162485: [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S added
Actions #2

Updated by okurz 27 days ago

  • Copied to action #162602: [FIRING:1] worker40 (worker40: CPU load alert openQA worker40 salt cpu_load_alert_worker40 worker) size:S added
Actions #3

Updated by okurz 27 days ago

  • Status changed from New to Blocked
  • Assignee set to okurz
Actions #4

Updated by okurz 26 days ago

  • Status changed from Blocked to In Progress
worker40:/var/lib/openqa # df -h /var/lib/openqa/
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      470G  433G   14G  98% /var/lib/openqa
worker40:/var/lib/openqa # du -x -d1 -BG | sort -n
1G  ./lost+found
59G ./cache
375G    ./pool
433G    .
Actions #5

Updated by okurz 26 days ago

worker40:/var/lib/openqa # du -x -d2 -BG | sort -n
…
40G ./pool/11
42G ./pool/21
46G ./pool/19
59G ./cache
59G ./cache/openqa.suse.de
376G    ./pool
435G    .

pool/19 belongs to currently running test

https://openqa.suse.de/tests/14690704 which has quite heavy

HDDSIZEGB   60
HDDSIZEGB_2     131

other partitions would have more space available

# lsblk 
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
nvme0n1     259:1    0   5.8T  0 disk  
├─nvme0n1p1 259:2    0   512M  0 part  /boot/efi
├─nvme0n1p2 259:3    0   5.8T  0 part  /var
…
│                                      /
└─nvme0n1p3 259:4    0     1G  0 part  [SWAP]
nvme2n1     259:5    0 476.9G  0 disk  
└─md127       9:127  0 476.8G  0 raid0 /var/lib/openqa

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/847 reducing instances 49->46

Actions #6

Updated by okurz 26 days ago

Actions #7

Updated by okurz 26 days ago

  • Due date set to 2024-07-05
  • Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/847 merged and applied. Let's see how much that helps.

Actions #8

Updated by okurz 26 days ago

  • Due date deleted (2024-07-05)
  • Status changed from Feedback to Resolved
Actions #9

Updated by okurz 20 days ago

  • Subject changed from [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) to [FIRING:1] worker40 (worker40: partitions usage (%) alert openQA partitions_usage_alert_worker40 worker) auto_review:"No space left on device":retry
  • Status changed from Resolved to In Progress

happening stlil repeatedly causing incomplete jobs like https://openqa.suse.de/tests/14737966

Actions #10

Updated by okurz 20 days ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/851

and triggered openqa-label-known-issues and openqa-advanced-retrigger

export host=openqa.suse.de; failed_since="'2024-06-27'" result="'incomplete'" ./openqa-monitor-investigation-candidates | ./openqa-label-known-issues-multi 

and

host=openqa.suse.de failed_since="2024-06-27 07:00" result="result='incomplete'" additional_filters="reason like '%terminated prematurely%'" comment="label:poo#162596" ./openqa-advanced-retrigger-jobs
Actions #11

Updated by okurz 20 days ago

  • Description updated (diff)
Actions #12

Updated by okurz 20 days ago

  • Status changed from In Progress to Feedback
Actions #14

Updated by okurz 14 days ago

  • Status changed from Feedback to Blocked
Actions #15

Updated by livdywan 8 days ago

okurz wrote in #note-14:

https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=1719900063344&to=1719949375574&viewPanel=65090 shows that we still have too high partition usage. Need to block on #162719

Work on the blocker starting now

Actions #16

Updated by livdywan 2 days ago

livdywan wrote in #note-15:

okurz wrote in #note-14:

https://monitor.qa.suse.de/d/WDworker40/worker-dashboard-worker40?orgId=1&from=1719900063344&to=1719949375574&viewPanel=65090 shows that we still have too high partition usage. Need to block on #162719

Work on the blocker starting now

Not just yet. But I bumped the priority to match.

Actions

Also available in: Atom PDF