action #150887: [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

action #150887

closed

[alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M

Added by okurz 8 months ago. Updated 6 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project - Ready

Start date:

2023-11-15

Due date:

% Done:

Estimated time:

Tags:

alert, infra, s390x, s390zl12

Description

Observation¶

From email
Firing
s390zl12: partitions usage (%) alert
View alert
Values
A0=88.11778063708574
Labels
alertname s390zl12: partitions usage (%) alert
grafana_folder Generic
hostname s390zl12
rule_uid partitions_usage_alert_s390zl12
type generic
Silence
View dashboard
View panel
Observed 32s before this notification was delivered, at 2023-11-15 03:48:00 +0100 CET

panel link http://stats.openqa-monitor.qa.suse.de/d/GDs390zl12?orgId=1&viewPanel=65090

From s390zl12

/dev/mapper/3600507638081855cd80000000000004b-part1 on /var/lib/libvirt/images type ext4 (rw,relatime,nobarrier,stripe=8,data=writeback)

so maybe as expected this was about /var/lib/libvirt/images and monitoring shows this to trigger from time to time again. okurz does not think it's wise to just brute-force delete by the cron job calling https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/libvirt/cleanup-openqa-assets?ref_type=heads more often but instead a better solution should be found to prevent overflowing storage.

Suggestions¶

DONE So we see that at least one partition was 88% full which is apparently above our threshold
DONE Check the actual threshold
DONE Ensure that our NFS share from OSD is not the one we alert about
DONE There is a cleanup script triggered by cron or systemd timer (TBC) which might trigger less often than what we check the partition usage for so maybe that is racy
Reduce the cron run interval anyway and unsilence to make it more likely to prevent alerts

Rollback actions¶

Remove silence for rule_uid=~partitions_usage_alert_s390zl.* from https://stats.openqa-monitor.qa.suse.de/alerting/silences

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #150887

[alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M

Observation¶

Suggestions¶

Rollback actions¶

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by okurz 8 months ago

Updated by tinita 8 months ago

Updated by tinita 8 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago

Updated by okurz 6 months ago