Project

General

Profile

Actions

action #151807

closed

[alert] o3 zabbix: Problem: /var/lib/snapshot-changes: Disk space is critically low (used > 94%) size:M

Added by tinita 10 months ago. Updated 10 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-11-30
Due date:
2023-12-16
% Done:

0%

Estimated time:

Description

Observation

https://progress.opensuse.org/issues/151807

Date: Thu, 30 Nov 2023 19:51:09 +0000                                                                                                                                                                         
From: zabbix@suse.de                                                                                                                                                                                          
To: o3-admins@suse.de                                                                                                                                                                                         
Subject: Problem: /space: Disk space is critically low (used > 94%)                                                                                                                                           

Problem started at 20:36:00 on 2023.11.30
Problem name: /space: Disk space is critically low (used > 94%)
Host: ariel.dmz-prg2.suse.org
Severity: Average
Operational data: Space used: 4.82 TB of 5 TB (96.4 %)
Original problem ID: 629867054

As you can see in the screenshot, it has been starting to increase around the last week of october.

Rollback

  • Enable hook scripts:
    • edit /opt/os-autoinst-scripts/openqa-label-known-issues-and-investigate-hook and remove the exit
    • comment in hooks in /etc/openqa/openqa.ini

Suggestions

  • Fix the problem and adjust the alert thresholds

Files

o3-disk-space.png (98.9 KB) o3-disk-space.png tinita, 2023-11-30 21:43
zabbix-disk-space-o3.png (64.7 KB) zabbix-disk-space-o3.png tinita, 2023-12-01 11:53
zabbix-trigger-prototypes.png (228 KB) zabbix-trigger-prototypes.png tinita, 2023-12-12 10:03

Related issues 1 (1 open0 closed)

Related to openQA Infrastructure - action #162521: Reconsider the global job limit on o3, try higher than 170New2024-06-19

Actions
Actions #1

Updated by tinita 10 months ago · Edited

  • Status changed from New to In Progress
  • Assignee set to tinita

We found out that we have a large number of jobs without a group. Mostly investigate jobs.
We now added limits to jobs without a group (like we had already on osd), and currently it seems the limit_results_and_logs cleanup is able to slowly delete stuff.

[no_group_limits]
log_storage_duration = 10
important_log_storage_duration = 25
result_storage_duration = 15
important_result_storage_duration = 50

I also disabled the hook scripts to stop creating new investigate jobs for now.
We're monitoring the situation.

Actions #2

Updated by tinita 10 months ago

  • Description updated (diff)
Actions #3

Updated by tinita 10 months ago

Additional question: Zabbix apparently thought already yesterday evenung that the problem had been resolved, although the graph clearly shows that the used space never went below 94% since the alarm: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=342059
Hm, and having a closer look at that graph, It was already above 94% way before: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=342059
See sreenshot.
Could it be that the alert has the 94 in the description, but the treshold is actually 96? That would explain the short timeframe, and that we actually got a new alert this morning when it went over 96 again?
I am again lost in zabbix configuration, I can't figure out the threshold.

Actions #4

Updated by jbaier_cz 10 months ago

tinita wrote in #note-3:

Additional question: Zabbix apparently thought already yesterday evenung that the problem had been resolved, although the graph clearly shows that the used space never went below 94% since the alarm: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=342059
Hm, and having a closer look at that graph, It was already above 94% way before: https://zabbix.suse.de/history.php?action=showgraph&itemids%5B%5D=342059
See sreenshot.
Could it be that the alert has the 94 in the description, but the treshold is actually 96? That would explain the short timeframe, and that we actually got a new alert this morning when it went over 96 again?
I am again lost in zabbix configuration, I can't figure out the threshold.

You can see the configuration of the trigger at https://zabbix.suse.de/triggers.php?form=update&triggerid=115396&context=host; the trigger is apparently a little more complicated that just a simple percentage:

Two conditions should match: First, space utilization should be above {$VFS.FS.PUSED.MAX.CRIT:"/var/lib/snapshot-changes"}.
Second condition should be one of the following:

  • The disk free space is less than 5G.
  • The disk will be full in less than 24 hours.

The second part of the condition timeleft(/ariel.dmz-prg2.suse.org/vfs.fs.size[/var/lib/snapshot-changes,pused],1h,100)<1d is what creates the recovery for the alarm if the space utilization growth is "not fast enough" to fill the partition in one day.

Actions #5

Updated by tinita 10 months ago · Edited

I would vote for removing that "fill the partition in one day" condition. If we are only alerted when there's 1 day left, that means we could run full on a weekend.

Actions #6

Updated by jbaier_cz 10 months ago

tinita wrote in #note-5:

I would vote for removing that "fill the partition in one day" condition. If we are only alerted when there's 1 day left, that means we could run full on a weekend.

Or we can tweak the interval, like have a prediction for a week maybe? But anyway, the trigger is from the trigger prototype (it is automatically added for all discovered partitions), the prototype is added from a generic template (which we cannot edit); so the process is:

  1. clone the prototype: https://zabbix.suse.de/trigger_prototypes.php?form=update&parent_discoveryid=341978&triggerid=115361&context=host
  2. edit the new clone as desired (the template expression for the triggers which will be created by the discovery)
  3. disable the old prototype (prevent the discovery) and disable the old triggers; new triggers from the clone will eventually appear
Actions #7

Updated by openqa_review 10 months ago

  • Due date set to 2023-12-16

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by mkittler 10 months ago · Edited

  • Status changed from In Progress to Resolved

Now after https://openqa.opensuse.org/minion/jobs?id=3198982 ran for 13 hours and several limit_screenshot jobs ran also for multiple hours (https://openqa.opensuse.org/minion/jobs?task=limit_screenshots) we're back at 50 % which is quite good.

/var/lib/snapshot-changes: Disk space is critically low (used > 94%) now also shows up as resolved in Zabbix.

We have also discussed this in the infra daily. So this ticket can be considered resolved.

Actions #9

Updated by tinita 10 months ago

  • Status changed from Resolved to Feedback

The rollback steps weren't conducted, which is why we did not have any investigate jobs or any other thing done by hook scripts.

I now enabled the hook scripts in openqa.ini, removed the temporary exit in openqa-label-known-issues-and-investigate-hook and restarted gru.

Also there was this open discussion about not being notified early enough, so I would rather like this being kept open and discuss it.

Actions #10

Updated by tinita 10 months ago

Also I set back max_running_jobs to 170. We lowered it to make sure the load is not too high so the cleanup job can finally finish.

Actions #11

Updated by tinita 10 months ago

Ok, i tried to clone the alert but Zabbix is having problems right now, will try again later.

Actions #12

Updated by tinita 10 months ago

  • Subject changed from [alert] o3 zabbix: Problem: /var/lib/snapshot-changes: Disk space is critically low (used > 94%) to [alert] o3 zabbix: Problem: /var/lib/snapshot-changes: Disk space is critically low (used > 94%) size:M
  • Description updated (diff)
Actions #13

Updated by tinita 10 months ago

  • Status changed from Feedback to In Progress

Still trying to figure out the Zabbix configuration. Apparently there is also a "Warning" trigger, but the only difference is that it would warn if there are only 10GB left, and the "Average" trigger (which is apparently higher than Warning) will warn if there are only 5GB left.
On a disk with several TB these hardcoded GB values of course don't make much sense.
Apart from that the condition about when the disk will be full is equal for both.
I increased the values for WARN and CRIT to 85% / 90% now and will edit the timeleft conditions.

Actions #14

Updated by tinita 10 months ago

jbaier_cz wrote in #note-6:

  1. clone the prototype: https://zabbix.suse.de/trigger_prototypes.php?form=update&parent_discoveryid=341978&triggerid=115361&context=host
  2. edit the new clone as desired (the template expression for the triggers which will be created by the discovery)
  3. disable the old prototype (prevent the discovery) and disable the old triggers; new triggers from the clone will eventually appear

I'm not sure if I did the right thing. There is no checkbox for disabling or enabling, just "Create enabled".
I don't know what "Discover" means, but I unchecked it for the old triggers.

Now I guess I have to wait until the new triggers will "eventually appear", but I don't know where they will appear and how long it could take. I mean, they already appear in the prototype list, because I created them.

I attached a screenshot of how it currently looks like.

Actions #15

Updated by tinita 10 months ago

  • Status changed from In Progress to Feedback
Actions #16

Updated by jbaier_cz 10 months ago

tinita wrote in #note-14:

jbaier_cz wrote in #note-6:

  1. clone the prototype: https://zabbix.suse.de/trigger_prototypes.php?form=update&parent_discoveryid=341978&triggerid=115361&context=host
  2. edit the new clone as desired (the template expression for the triggers which will be created by the discovery)
  3. disable the old prototype (prevent the discovery) and disable the old triggers; new triggers from the clone will eventually appear

I'm not sure if I did the right thing. There is no checkbox for disabling or enabling, just "Create enabled".
I don't know what "Discover" means, but I unchecked it for the old triggers.

Yes, disabling discovery is the right choice (discovery is the process of automatic item creation).

Now I guess I have to wait until the new triggers will "eventually appear", but I don't know where they will appear and how long it could take. I mean, they already appear in the prototype list, because I created them.

The main discovery rule (https://zabbix.suse.de/host_discovery.php?form=update&itemid=341978&context=host) has 1h update interval, so the new items/triggers on the host should appear within an hour. The old will disappear within 30 days.

Actions #17

Updated by tinita 10 months ago

Maybe we can look at it together thursday? I don't know where to look for its appearance...

Actions #18

Updated by jbaier_cz 10 months ago

tinita wrote in #note-17:

Maybe we can look at it together thursday? I don't know where to look for its appearance...

Yes, we can if you want. For what I can see, your new triggers are there: https://zabbix.suse.de/triggers.php?filter_set=1&filter_hostids%5B0%5D=10923&context=host

Actions #19

Updated by tinita 10 months ago

  • Status changed from Feedback to Resolved

Ok, we checked it together and the new triggers are here: https://zabbix.suse.de/triggers.php?filter_set=1&filter_hostids%5B0%5D=10923&context=host
and the old triggers are not listed there anymore.

Actions #20

Updated by okurz 4 months ago

  • Related to action #162521: Reconsider the global job limit on o3, try higher than 170 added
Actions

Also available in: Atom PDF