action #150887
Updated by okurz 11 months ago
## Observation From email Firing s390zl12: partitions usage (%) alert View alert Values A0=88.11778063708574 Labels alertname s390zl12: partitions usage (%) alert grafana_folder Generic hostname s390zl12 rule_uid partitions_usage_alert_s390zl12 type generic Silence View dashboard View panel Observed 32s before this notification was delivered, at 2023-11-15 03:48:00 +0100 CET panel link http://stats.openqa-monitor.qa.suse.de/d/GDs390zl12?orgId=1&viewPanel=65090 From s390zl12 ``` /dev/mapper/3600507638081855cd80000000000004b-part1 on /var/lib/libvirt/images type ext4 (rw,relatime,nobarrier,stripe=8,data=writeback) ``` so maybe as expected this was about /var/lib/libvirt/images and monitoring shows this to trigger from time to time again. okurz does not think it's wise to just brute-force delete by the cron job calling https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/libvirt/cleanup-openqa-assets?ref_type=heads more often but instead a better solution should be found to prevent overflowing storage. ## Suggestions * *DONE* So we see that at least one partition was 88% full which is apparently above our threshold * *DONE* Check the actual threshold * *DONE* Ensure that our NFS share from OSD is not the one we alert about * *DONE* There is a cleanup script triggered by cron or systemd timer (TBC) which might trigger less often than what we check the partition usage for so maybe that is racy * Reduce the cron run interval anyway and unsilence to make it more likely to prevent alerts ## Rollback actions * Remove silence for `rule_uid=~partitions_usage_alert_s390zl.*` from https://stats.openqa-monitor.qa.suse.de/alerting/silences