Project

General

Profile

action #150887

Updated by okurz 4 months ago

## Observation 
 From email 
  Firing 
 s390zl12: partitions usage (%) alert 
 View alert 
 Values 
 A0=88.11778063708574  
 Labels 
 alertname       s390zl12: partitions usage (%) alert 
 grafana_folder       Generic 
 hostname       s390zl12 
 rule_uid       partitions_usage_alert_s390zl12 
 type       generic 
 Silence 
 View dashboard 
 View panel 
 Observed 32s before this notification was delivered, at 2023-11-15 03:48:00 +0100 CET 

 panel link http://stats.openqa-monitor.qa.suse.de/d/GDs390zl12?orgId=1&viewPanel=65090 

 From s390zl12 

 ``` 
 /dev/mapper/3600507638081855cd80000000000004b-part1 on /var/lib/libvirt/images type ext4 (rw,relatime,nobarrier,stripe=8,data=writeback) 
 ``` 

 so maybe as expected this was about /var/lib/libvirt/images and monitoring shows this to trigger from time to time again. okurz does not think it's wise to just brute-force delete by the cron job calling https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/libvirt/cleanup-openqa-assets?ref_type=heads more often but instead a better solution should be found to prevent overflowing storage. 

 ## Suggestions 
 * *DONE* So we see that at least one partition was 88% full which is apparently above our threshold 
 * *DONE* Check the actual threshold 
 * *DONE* Ensure that our NFS share from OSD is not the one we alert about 
 * *DONE* There is a cleanup script triggered by cron or systemd timer (TBC) which might trigger less often than what we check the partition usage for so maybe that is racy 
 * Reduce the cron run interval anyway and unsilence to make it more likely to prevent alerts 


 

 ## Rollback actions 
 * Remove silence for `rule_uid=~partitions_usage_alert_s390zl.*` from https://stats.openqa-monitor.qa.suse.de/alerting/silences

Back