action #151807
closed
[alert] o3 zabbix: Problem: /var/lib/snapshot-changes: Disk space is critically low (used > 94%) size:M
Added by tinita about 1 year ago.
Updated about 1 year ago.
Description
Observation¶
https://progress.opensuse.org/issues/151807
Date: Thu, 30 Nov 2023 19:51:09 +0000
From: zabbix@suse.de
To: o3-admins@suse.de
Subject: Problem: /space: Disk space is critically low (used > 94%)
Problem started at 20:36:00 on 2023.11.30
Problem name: /space: Disk space is critically low (used > 94%)
Host: ariel.dmz-prg2.suse.org
Severity: Average
Operational data: Space used: 4.82 TB of 5 TB (96.4 %)
Original problem ID: 629867054
As you can see in the screenshot, it has been starting to increase around the last week of october.
Rollback¶
- Enable hook scripts:
- edit /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook and remove the
exit
- comment in hooks in /etc/openqa/openqa.ini
Suggestions¶
- Fix the problem and adjust the alert thresholds
Files
Related issues
1 (1 open — 0 closed)
- Status changed from New to In Progress
- Assignee set to tinita
We found out that we have a large number of jobs without a group. Mostly investigate jobs.
We now added limits to jobs without a group (like we had already on osd), and currently it seems the limit_results_and_logs
cleanup is able to slowly delete stuff.
[no_group_limits]
log_storage_duration = 10
important_log_storage_duration = 25
result_storage_duration = 15
important_result_storage_duration = 50
I also disabled the hook scripts to stop creating new investigate jobs for now.
We're monitoring the situation.
- Description updated (diff)
Additional question: Zabbix apparently thought already yesterday evenung that the problem had been resolved, although the graph clearly shows that the used space never went below 94% since the alarm: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=342059
Hm, and having a closer look at that graph, It was already above 94% way before: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=342059
See sreenshot.
Could it be that the alert has the 94
in the description, but the treshold is actually 96? That would explain the short timeframe, and that we actually got a new alert this morning when it went over 96 again?
I am again lost in zabbix configuration, I can't figure out the threshold.
tinita wrote in #note-3:
Additional question: Zabbix apparently thought already yesterday evenung that the problem had been resolved, although the graph clearly shows that the used space never went below 94% since the alarm: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=342059
Hm, and having a closer look at that graph, It was already above 94% way before: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=342059
See sreenshot.
Could it be that the alert has the 94
in the description, but the treshold is actually 96? That would explain the short timeframe, and that we actually got a new alert this morning when it went over 96 again?
I am again lost in zabbix configuration, I can't figure out the threshold.
You can see the configuration of the trigger at https://zabbix.suse.de/triggers.php?form=update&triggerid=115396&context=host; the trigger is apparently a little more complicated that just a simple percentage:
Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.CRIT:"/var/lib/snapshot-changes"}.
Second condition should be one of the following:
- The disk free space is less than 5G.
- The disk will be full in less than 24 hours.
The second part of the condition timeleft(/ariel.dmz-prg2.suse.org/vfs.fs.size[/var/lib/snapshot-changes,pused],1h,100)<1d
is what creates the recovery for the alarm if the space utilization growth is "not fast enough" to fill the partition in one day.
I would vote for removing that "fill the partition in one day" condition. If we are only alerted when there's 1 day left, that means we could run full on a weekend.
tinita wrote in #note-5:
I would vote for removing that "fill the partition in one day" condition. If we are only alerted when there's 1 day left, that means we could run full on a weekend.
Or we can tweak the interval, like have a prediction for a week maybe? But anyway, the trigger is from the trigger prototype (it is automatically added for all discovered partitions), the prototype is added from a generic template (which we cannot edit); so the process is:
- clone the prototype: https://zabbix.suse.de/trigger_prototypes.php?form=update&parent_discoveryid=341978&triggerid=115361&context=host
- edit the new clone as desired (the template expression for the triggers which will be created by the discovery)
- disable the old prototype (prevent the discovery) and disable the old triggers; new triggers from the clone will eventually appear
- Due date set to 2023-12-16
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Resolved
- Status changed from Resolved to Feedback
The rollback steps weren't conducted, which is why we did not have any investigate jobs or any other thing done by hook scripts.
I now enabled the hook scripts in openqa.ini, removed the temporary exit
in openqa-label-known-issues-and-investigate-hook
and restarted gru.
Also there was this open discussion about not being notified early enough, so I would rather like this being kept open and discuss it.
Also I set back max_running_jobs to 170. We lowered it to make sure the load is not too high so the cleanup job can finally finish.
Ok, i tried to clone the alert but Zabbix is having problems right now, will try again later.
- Subject changed from [alert] o3 zabbix: Problem: /var/lib/snapshot-changes: Disk space is critically low (used > 94%) to [alert] o3 zabbix: Problem: /var/lib/snapshot-changes: Disk space is critically low (used > 94%) size:M
- Description updated (diff)
- Status changed from Feedback to In Progress
Still trying to figure out the Zabbix configuration. Apparently there is also a "Warning" trigger, but the only difference is that it would warn if there are only 10GB left, and the "Average" trigger (which is apparently higher than Warning) will warn if there are only 5GB left.
On a disk with several TB these hardcoded GB values of course don't make much sense.
Apart from that the condition about when the disk will be full is equal for both.
I increased the values for WARN and CRIT to 85% / 90% now and will edit the timeleft conditions.
jbaier_cz wrote in #note-6:
- clone the prototype: https://zabbix.suse.de/trigger_prototypes.php?form=update&parent_discoveryid=341978&triggerid=115361&context=host
- edit the new clone as desired (the template expression for the triggers which will be created by the discovery)
- disable the old prototype (prevent the discovery) and disable the old triggers; new triggers from the clone will eventually appear
I'm not sure if I did the right thing. There is no checkbox for disabling or enabling, just "Create enabled".
I don't know what "Discover" means, but I unchecked it for the old triggers.
Now I guess I have to wait until the new triggers will "eventually appear", but I don't know where they will appear and how long it could take. I mean, they already appear in the prototype list, because I created them.
I attached a screenshot of how it currently looks like.
- Status changed from In Progress to Feedback
tinita wrote in #note-14:
jbaier_cz wrote in #note-6:
- clone the prototype: https://zabbix.suse.de/trigger_prototypes.php?form=update&parent_discoveryid=341978&triggerid=115361&context=host
- edit the new clone as desired (the template expression for the triggers which will be created by the discovery)
- disable the old prototype (prevent the discovery) and disable the old triggers; new triggers from the clone will eventually appear
I'm not sure if I did the right thing. There is no checkbox for disabling or enabling, just "Create enabled".
I don't know what "Discover" means, but I unchecked it for the old triggers.
Yes, disabling discovery is the right choice (discovery is the process of automatic item creation).
Now I guess I have to wait until the new triggers will "eventually appear", but I don't know where they will appear and how long it could take. I mean, they already appear in the prototype list, because I created them.
The main discovery rule (https://zabbix.suse.de/host_discovery.php?form=update&itemid=341978&context=host) has 1h update interval, so the new items/triggers on the host should appear within an hour. The old will disappear within 30 days.
Maybe we can look at it together thursday? I don't know where to look for its appearance...
- Status changed from Feedback to Resolved
- Related to action #162521: Reconsider the global job limit on o3, try higher than 170 added
Also available in: Atom
PDF