Project

General

Profile

Actions

action #150887

closed

[alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M

Added by okurz about 1 year ago. Updated 11 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Start date:
2023-11-15
Due date:
% Done:

0%

Estimated time:

Description

Observation

From email
Firing
s390zl12: partitions usage (%) alert
View alert
Values
A0=88.11778063708574
Labels
alertname s390zl12: partitions usage (%) alert
grafana_folder Generic
hostname s390zl12
rule_uid partitions_usage_alert_s390zl12
type generic
Silence
View dashboard
View panel
Observed 32s before this notification was delivered, at 2023-11-15 03:48:00 +0100 CET

panel link http://stats.openqa-monitor.qa.suse.de/d/GDs390zl12?orgId=1&viewPanel=65090

From s390zl12

/dev/mapper/3600507638081855cd80000000000004b-part1 on /var/lib/libvirt/images type ext4 (rw,relatime,nobarrier,stripe=8,data=writeback)

so maybe as expected this was about /var/lib/libvirt/images and monitoring shows this to trigger from time to time again. okurz does not think it's wise to just brute-force delete by the cron job calling https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/libvirt/cleanup-openqa-assets?ref_type=heads more often but instead a better solution should be found to prevent overflowing storage.

Suggestions

  • DONE So we see that at least one partition was 88% full which is apparently above our threshold
  • DONE Check the actual threshold
  • DONE Ensure that our NFS share from OSD is not the one we alert about
  • DONE There is a cleanup script triggered by cron or systemd timer (TBC) which might trigger less often than what we check the partition usage for so maybe that is racy
  • Reduce the cron run interval anyway and unsilence to make it more likely to prevent alerts

Rollback actions


Related issues 3 (1 open2 closed)

Related to openQA Infrastructure (public) - action #138650: partition usage panels show a long list of undefined and no reasonable graphs at least for some generic machines size:MResolvedtinita2023-10-27

Actions
Related to openQA Infrastructure (public) - action #164853: [alert][FIRING:1] s390zl13 (s390zl13: Memory usage alert Generic memory_usage_alert_s390zl13 generic) size:SResolvedokurz2024-08-02

Actions
Copied to openQA Infrastructure (public) - action #154180: Proper kvm asset cleanup for s390x kvm backend (svirt) and testsWorkable

Actions
Actions #1

Updated by okurz about 1 year ago

  • Related to action #138650: partition usage panels show a long list of undefined and no reasonable graphs at least for some generic machines size:M added
Actions #2

Updated by okurz about 1 year ago

  • Subject changed from [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic) to [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13
  • Description updated (diff)
Actions #3

Updated by okurz about 1 year ago

  • Target version changed from Tools - Next to Ready
Actions #4

Updated by okurz about 1 year ago

  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to okurz
  • Target version changed from Ready to Tools - Next

we need to fix #138650 first to know which partition this was/is about

Actions #5

Updated by tinita about 1 year ago

#138650 in progress

Actions #6

Updated by tinita about 1 year ago

  • Status changed from Blocked to New

#138650 resolved

Actions #7

Updated by okurz 11 months ago

  • Assignee deleted (okurz)

https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&viewPanel=65090&from=1699934368877&to=1700076581131 now shows the problem clearly with "device dm-1", fstype ext4. On s390zl12 then I can find

/dev/mapper/3600507638081855cd80000000000004b-part1 on /var/lib/libvirt/images type ext4 (rw,relatime,nobarrier,stripe=8,data=writeback)

so maybe as expected this was about /var/lib/libvirt/images and monitoring shows this to trigger from time to time again. I don't think it's wise to just brute-force delete by the cron job calling https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/libvirt/cleanup-openqa-assets?ref_type=heads more often but instead a better solution should be found to prevent overflowing storage.

Actions #8

Updated by okurz 11 months ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version changed from Tools - Next to Ready
Actions #9

Updated by okurz 11 months ago

  • Subject changed from [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 to [alert] [FIRING:1] s390zl12 (s390zl12: partitions usage (%) alert Generic partitions_usage_alert_s390zl12 generic), also s390zl13 size:M
Actions #10

Updated by okurz 11 months ago

  • Copied to action #154180: Proper kvm asset cleanup for s390x kvm backend (svirt) and tests added
Actions #11

Updated by okurz 11 months ago

  • Due date set to 2024-01-31
  • Status changed from In Progress to Feedback
Actions #12

Updated by okurz 11 months ago

  • Due date deleted (2024-01-31)
  • Status changed from Feedback to Resolved

https://stats.openqa-monitor.qa.suse.de/d/GDs390zl12/dashboard-for-s390zl12?orgId=1&viewPanel=65090&from=1705989220034&to=1706212393466 shows peaks going up to 75% so the alert might still appear but we will see. I removed the alert silence.

Actions #13

Updated by okurz 4 months ago

  • Related to action #164853: [alert][FIRING:1] s390zl13 (s390zl13: Memory usage alert Generic memory_usage_alert_s390zl13 generic) size:S added
Actions

Also available in: Atom PDF