Project

General

Profile

Actions

action #174313

closed

[o3][zabbix][alert] / and /var/tmp: "Disk space is low and might be full in 7d (used > 85%)" since 2024-12-11 06:50 size:S

Added by okurz 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

From https://zabbix.nue.suse.com/zabbix.php?show=1&name=&inventory%5B0%5D%5Bfield%5D=type&inventory%5B0%5D%5Bvalue%5D=&evaltype=0&tags%5B0%5D%5Btag%5D=&tags%5B0%5D%5Boperator%5D=0&tags%5B0%5D%5Bvalue%5D=&show_tags=3&tag_name_format=0&tag_priority=&show_opdata=0&show_timeline=1&filter_name=&filter_show_counter=0&filter_custom_time=0&sort=clock&sortorder=DESC&age_state=0&show_suppressed=0&unacknowledged=0&compact_view=0&details=0&highlight_row=0&action=problem.view

2024-12-11 06:50:26                                Warning                PROBLEM                ariel.dmz-prg2.suse.org        /var/tmp: Disk space is low and might be full in 7d (used > 85%)        1d 9h 40m        No                Application: Filesystem /var/tmp
2024-12-11 06:50:23                                Warning                PROBLEM                ariel.dmz-prg2.suse.org        /: Disk space is low and might be full in 7d (used > 85%)        1d 9h 40m        No                Application: Filesystem /

Suggestions

  • we're keeping a long list of old packages in /var/cache/zypp/packages/. It goes back to february 2023
  • Research if zypper can provide such options, otherwise add a custom systemd service or extend openqa-auto-update to remove older cached packages based on number and/or age
  • Ensure that this frees up enough space and crosscheck the alert on zabbix again

Related issues 3 (1 open2 closed)

Related to openQA Infrastructure (public) - action #40196: [monitoring] monitor internal port 9526, port 80, external port 443 accessibility of o3 and response times size:MResolvedokurz2018-08-23

Actions
Related to openQA Project (public) - action #176145: Preserve package cache on worker hostsNewokurz2025-01-24

Actions
Copied to openQA Infrastructure (public) - action #174316: [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:SResolvedjbaier_cz2024-12-12

Actions
Actions #1

Updated by okurz 3 months ago

  • Copied to action #174316: [o3][zabbix][alert] no email about zabbix alerts including storage and cpu load size:S added
Actions #2

Updated by okurz 3 months ago

  • Related to action #40196: [monitoring] monitor internal port 9526, port 80, external port 443 accessibility of o3 and response times size:M added
Actions #3

Updated by gpathak 3 months ago

  • Assignee set to gpathak

The /var directory is taking up 11GiB.

gpathak@ariel:~> sudo du -ahcx /var/ | sort -hr | head
11G /var/
11G total
8.5G    /var/cache
8.4G    /var/cache/zypp
8.3G    /var/cache/zypp/packages
7.7G    /var/cache/zypp/packages/devel_openQA
5.3G    /var/cache/zypp/packages/devel_openQA/x86_64
2.4G    /var/cache/zypp/packages/devel_openQA/noarch
1.3G    /var/log
910M    /var/log/journal/06446c641307496183dfdf8dccebdceb
gpathak@ariel:~> 

The /var/log is 1.3GiB

Actions #4

Updated by gpathak 3 months ago

Actions #5

Updated by gpathak 3 months ago

  • Assignee deleted (gpathak)
Actions #6

Updated by tinita 3 months ago

It seems we're keeping a long list of old packages in /var/cache/zypp/packages/. It goes back to february 2023:

ls -lrth /var/cache/zypp/packages/devel_openQA/x86_64/openQA-common-*                                                                                                                             
-rw-r--r-- 1 root root 459K Feb 15  2023 /var/cache/zypp/packages/devel_openQA/x86_64/openQA-common-4.6.1676474487.945e502-lp154.5577.1.x86_64.rpm                                                                

Not sure how to configure this to a lower duration.

Actions #7

Updated by okurz 3 months ago

  • Subject changed from [o3][zabbix][alert] / and /var/tmp: "Disk space is low and might be full in 7d (used > 85%)" since 2024-12-11 06:50 to [o3][zabbix][alert] / and /var/tmp: "Disk space is low and might be full in 7d (used > 85%)" since 2024-12-11 06:50 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by mkittler 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #9

Updated by mkittler 2 months ago · Edited

We're keeping the packages of the following repos indefinitely:

grep -iR keeppackages=1 /etc/zypp/repos.d
/etc/zypp/repos.d/devel_openQA.repo:keeppackages=1
/etc/zypp/repos.d/devel_openQA_Leap.repo:keeppackages=1

Not sure whether zypper has a way of specifying the number of packages to keep. For now I just used `find /var/cache/zypp/packages -ipath 'devel_openqa' -mtime +365 -delete´ to delete everything older than a year.

I can setup a systemd service/timer to invoke a command like that periodically. I can also set keeppackages=0 but we probably enabled this for the sake of easier downgrades. So this is probably not a good solution.

One could also add the following to openqa-auto-update:

if [[ $OPENQA_PACKAGE_CACHE_RETENTION ]]; then
    find /var/cache/zypp/packages -type f -ipath '*devel*openQA*' -mtime "+$OPENQA_PACKAGE_CACHE_RETENTION" -delete
fi

Of course this breaks if one uses a different repository name or a different packagesdir. So it is probably not the best idea to add it to the generic openqa-auto-update script.

Actions #10

Updated by openqa_review 2 months ago

  • Due date set to 2025-01-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by mkittler 2 months ago

  • Status changed from In Progress to Feedback

I added, tested and enabled a simple systemd service/timer on ariel:

martchus@ariel:~> cat /etc/systemd/system/package-cleanup.service 
[Unit]
Description=Cleans up old packages in zypper cache directory

[Service]
Type=oneshot
ExecStart=find /var/cache/zypp/packages -type f -ipath '*devel*openQA*' -mtime +100 -delete

martchus@ariel:~> cat /etc/systemd/system/package-cleanup.timer
[Unit]
Description=Cleans up old packages in zypper cache directory

[Timer]
OnBootSec=15min
OnUnitActiveSec=1w

[Install]
WantedBy=timers.target

This is probably simple enough to not manage this in some repository. (We do have backups of /etc on ariel via the backup VM.)

Actions #12

Updated by okurz 2 months ago

  • Due date changed from 2025-01-02 to 2025-01-24
  • Status changed from Feedback to Workable
  • Priority changed from High to Normal

mkittler wrote in #note-9:

We're keeping the packages of the following repos indefinitely:

grep -iR keeppackages=1 /etc/zypp/repos.d
/etc/zypp/repos.d/devel_openQA.repo:keeppackages=1
/etc/zypp/repos.d/devel_openQA_Leap.repo:keeppackages=1

Not sure whether zypper has a way of specifying the number of packages to keep. For now I just used `find /var/cache/zypp/packages -ipath 'devel_openqa' -mtime +365 -delete´ to delete everything older than a year.

I can setup a systemd service/timer to invoke a command like that periodically. I can also set keeppackages=0 but we probably enabled this for the sake of easier downgrades. So this is probably not a good solution.

One could also add the following to openqa-auto-update:

if [[ $OPENQA_PACKAGE_CACHE_RETENTION ]]; then
    find /var/cache/zypp/packages -type f -ipath '*devel*openQA*' -mtime "+$OPENQA_PACKAGE_CACHE_RETENTION" -delete
fi

Of course this breaks if one uses a different repository name or a different packagesdir. So it is probably not the best idea to add it to the generic openqa-auto-update script.

Well, as openqa-auto-update is openQA-specific, at least in the name, but also because it calls https://github.com/os-autoinst/openQA/blob/master/script/openqa-check-devel-repo which uses devel:openQA I guess it's a good idea to cover that in the script. Also I wouldn't use the mtime, at least not alone. If for whatever reason no upgrade was conducted for 4 months and then a faulty upgrade is conducted then any older version would have been pruned. How about something like find -mtime +100 | tail -n +$OPENQA_PACKAGE_CACHE_RETENTION_KEEP_MIN to keep at least OPENQA_PACKAGE_CACHE_RETENTION_KEEP_MIN package files (careful, that's not that many versions as we have many subpackages).

Actions #13

Updated by livdywan about 2 months ago

Let's block on #174316 before trying to adjust the numbers

Actions #14

Updated by livdywan about 2 months ago

  • Status changed from Workable to Blocked
Actions #15

Updated by jbaier_cz about 2 months ago

A side note, /var/tmp looks to be actually the same filesystem as /, it seems that zabbix wrongly detected it twice.

Actions #16

Updated by jbaier_cz about 2 months ago · Edited

and for the reference, it is a bug in the provided systemd unit, see https://github.com/voxpupuli/puppet-zabbix/issues/320 for more context. I adjusted the unit file to fix that issue.

Actions #17

Updated by jbaier_cz about 2 months ago

  • Status changed from Blocked to Workable

Blocker solved

Actions #18

Updated by tinita about 2 months ago

jbaier_cz wrote in #note-16:

and for the reference, it is a bug in the provided systemd unit, see https://github.com/voxpupuli/puppet-zabbix/issues/320 for more context. I adjusted the unit file to fix that issue.

Could you write down here the change you made? I don't really get it.

Actions #19

Updated by mkittler about 2 months ago

@okurz

Well, as openqa-auto-update is openQA-specific, at least in the name, but also because it calls https://github.com/os-autoinst/openQA/blob/master/script/openqa-check-devel-repo which uses devel:openQA …

I looked into what we do so far as well. The problem is not that the new code is specific to openQA and our concrete packaging. The existing code is as well - which makes sense because it is part of the concrete packaging.

The problem is what I mentioned in #174313#note-9:

Of course this breaks if one uses a different repository name or a different packagesdir.

In other words, the new code is specific to the concrete local repository setup and zypper configuration. I considered making it read the relevant bits from the zypper config file but found it too involved.

I'm not sure what your code with tail … would achieve. However, I suppose it would indeed make sense to keep a certain number of copies instead of going by time. Since I've been using openSUSE I find the lack of a tool like paccache which I'm used to from Arch Linux (and MSYS2) and does exactly what you suggested quite annoying.

Actions #20

Updated by mkittler about 2 months ago · Edited

@okurz What about something like this?

openqa-clean-devel-repo-cache:

#!/bin/bash
set -e

OPENQA_PACKAGE_CACHE_RETENTION=${OPENQA_PACKAGE_CACHE_RETENTION:-100}
OPENQA_PACKAGE_CACHE_RETENTION_KEEP_MIN=${OPENQA_PACKAGE_CACHE_RETENTION_KEEP_MIN:-3}
OPENQA_PACKAGE_CACHE_PATH=${OPENQA_PACKAGE_CACHE_PATH:-/hdd/cache/zypp/packages}
OPENQA_PACKAGE_CACHE_REPO_GLOB=${OPENQA_PACKAGE_CACHE_REPO_GLOB:-'*devel*openQA*'}

IFS=$'\n'
package_files=($(find "$OPENQA_PACKAGE_CACHE_PATH" -type f -ipath "$OPENQA_PACKAGE_CACHE_REPO_GLOB" -mtime "+$OPENQA_PACKAGE_CACHE_RETENTION" | sort -rV))

previous_package_name=
package_count=0

for package_file in "${package_files[@]}"; do
    package_name=$(rpm -q --qf "%{NAME}\n" "$package_file")
    if [[ $package_name != "$previous_package_name" ]]; then
        previous_package_name=$package_name
        package_count=0
    fi
    package_count=$((package_count + 1))
    if [[ $package_count -gt "$OPENQA_PACKAGE_CACHE_RETENTION_KEEP_MIN" ]]; then
        echo "rm  " "$package_file"
    else
        echo "keep" "$package_file"
    fi
done

One could also remove the mtime parameter completely. The use of sort -rV should make sure that the newest packages survive. The use of rpm -q --qf "%{NAME}\n" "$package_file" helps to decide which package files are actually the same package (but just different versions).

This script produces sane output on my local system (also when removing the mtime parameter).

We still have to decide where to this script. I suppose we could add it to openqa-auto-update with all the specifics put into variables. I would make it so it doesn't run by default. We could however specify common defaults for OPENQA_PACKAGE_CACHE_PATH and OPENQA_PACKAGE_CACHE_RETENTION_KEEP_MIN.

Actions #21

Updated by okurz about 2 months ago

mkittler wrote in #note-20:

@okurz What about something like this?

LGTM

One could also remove the mtime parameter completely.

I would keep the mtime as a safety measure.

We still have to decide where to this script. I suppose we could add it to openqa-auto-update with all the specifics put into variables. I would make it so it doesn't run by default. We could however specify common defaults for OPENQA_PACKAGE_CACHE_PATH and OPENQA_PACKAGE_CACHE_RETENTION_KEEP_MIN.

yes, all that sounds good.

Actions #22

Updated by jbaier_cz about 2 months ago

tinita wrote in #note-18:

jbaier_cz wrote in #note-16:

and for the reference, it is a bug in the provided systemd unit, see https://github.com/voxpupuli/puppet-zabbix/issues/320 for more context. I adjusted the unit file to fix that issue.

Could you write down here the change you made? I don't really get it.

Sure, see systemctl cat zabbix_agentd.service, I just added a following snippet as recommended in the linked issue:

# /etc/systemd/system/zabbix_agentd.service.d/override.conf
[Service]
PrivateTmp=no
Actions #23

Updated by mkittler about 2 months ago

  • Status changed from Workable to Feedback

PR: https://github.com/os-autoinst/openQA/pull/6104

I have also disabled the timer I previously configured again.

Actions #24

Updated by mkittler about 2 months ago

  • Status changed from Feedback to Resolved

The PR has been merged and deployed on o3. I also enabled the cleanup there. Currently there's not much to see because there's nothing to be cleaned up. That is expected because the find -mtime +100 … service/timer was still enabled before and we also keep up to 10 versions of each package. I did a dry run of the script with different parameters to see how it behaves on ariel and it seems to work.

Actions #25

Updated by okurz about 2 months ago

  • Due date deleted (2025-01-24)
  • Status changed from Resolved to Workable

reopening, see #175464-12

Actions #26

Updated by mkittler about 2 months ago

  • Status changed from Workable to Rejected

There are plenty of cached packages on ariel, e.g. the command you mentioned (find /var/cache/zypp/ | less) returns many results. If this is about workers (where the cache is indeed empty) then this is completely unrelated because my change to auto-update is not enabled by default and was only enabled on ariel. (If someone enabled it meanwhile elsewhere that is not a reason to reopen this ticket.)

Actions #27

Updated by mkittler about 2 months ago

I also just had a look at one of the repo config files on openqaworker23 (/etc/zypp/repos.d/devel_openQA.repo) and I don't see that keeppackages=1 is configured. So a clean cache directory is supposedly expected on that machines (and probably others).

Actions #28

Updated by okurz about 2 months ago

  • Status changed from Rejected to Resolved

Alright, seems like we never had it on workers then

Actions #29

Updated by livdywan about 1 month ago

  • Related to action #176145: Preserve package cache on worker hosts added
Actions

Also available in: Atom PDF