Project

General

Custom queries

Profile

Actions

action #179629

open

[alert] Root partition on OSD was almost completely full

Added by mkittler 4 days ago. Updated about 23 hours ago.

Status:
Feedback
Priority:
High
Assignee:
Category:
Regressions/Crashes
Start date:
2025-03-28
Due date:
2025-04-12 (Due in 11 days)
% Done:

0%

Estimated time:

Description

Observation

The root partition on OSD was almost completely full from 2025-03-28 07:29:19 to 2025-03-28 07:40:50. It went back to normal. We should investigate what happened. Considering the usage is back to only 60 % there's probably no need for cleanup right now.

https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-03-21T19:08:44.678Z&to=2025-03-28T09:23:32.922Z&var-host_disks=$__all&refresh=15m&timezone=UTC&viewPanel=panel-74
shows multiple short-time spikes. We should take a look if users log in and copy around assets. If they do yell at them!

Actions #1

Updated by mkittler 4 days ago

Journal on OSD:

Mar 28 07:30:01 openqa systemd[1]: Started Session c38441 of User geekotest.
Mar 28 07:30:01 openqa systemd[1]: Started Session c38442 of User geekotest.
Mar 28 07:30:01 openqa systemd[1]: Started Session c38443 of User geekotest.
Mar 28 07:30:01 openqa systemd[1]: Started Session c38444 of User geekotest.
Mar 28 07:30:01 openqa cron[17472]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Mar 28 07:30:01 openqa cron[17475]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Mar 28 07:30:01 openqa openqa-websockets-daemon[11726]: [debug] [pid:11726] Updating seen of worker 3411 from worker_status (free)
Mar 28 07:30:01 openqa CRON[17513]: (geekotest) CMD ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-12-SP?-SDK-POOL-{x86_64,aarch64,ppc64le,s390x}-BuildGM-Media1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa cron[17473]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Mar 28 07:30:01 openqa CRON[17521]: (geekotest) CMD ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-15-*-s390x-GM-*1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa cron[17476]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Mar 28 07:30:01 openqa CRON[17523]: (geekotest) CMD ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-12-{,SP?-}Server-DVD-s390x-GM-DVD1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa CRON[17475]: (geekotest) CMDEND ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-12-SP?-SDK-POOL-{x86_64,aarch64,ppc64le,s390x}-BuildGM-Media1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa systemd[1]: session-c38442.scope: Deactivated successfully.
Mar 28 07:30:01 openqa CRON[17475]: pam_unix(crond:session): session closed for user geekotest
Mar 28 07:30:01 openqa CRON[17476]: (geekotest) CMDEND ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-12-{,SP?-}Server-DVD-s390x-GM-DVD1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: [debug] [vZ3Nh_FvG0pg] looking for "autoinst-log.txt" in [
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]:   "/var/lib/openqa/testresults/17174/17174343-sle-15-SP7-Windows_11_UEFI-x86_64-wsl-main+register:investigate:retry\@win11_uefi",
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]:   "/var/lib/openqa/testresults/17174/17174343-sle-15-SP7-Windows_11_UEFI-x86_64-wsl-main+register:investigate:retry\@win11_uefi/ulogs",
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: ]
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: [debug] [vZ3Nh_FvG0pg] found bless({
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]:   path => "/var/lib/openqa/testresults/17174/17174343-sle-15-SP7-Windows_11_UEFI-x86_64-wsl-main+register:investigate:retry\@win11_uefi/autoinst-log.txt",
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]:   pid  => 21686,
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: }, "Mojo::Asset::File")
Mar 28 07:30:01 openqa CRON[17473]: (geekotest) CMDEND ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-15-*-s390x-GM-*1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa CRON[17476]: pam_unix(crond:session): session closed for user geekotest
Mar 28 07:30:01 openqa CRON[17473]: pam_unix(crond:session): session closed for user geekotest
Mar 28 07:30:01 openqa systemd[1]: session-c38444.scope: Deactivated successfully.
Mar 28 07:30:01 openqa systemd[1]: session-c38443.scope: Deactivated successfully.
Mar 28 07:30:02 openqa CRON[17472]: pam_unix(crond:session): session closed for user geekotest
Mar 28 07:30:02 openqa systemd[1]: session-c38441.scope: Deactivated successfully.
…
Mar 28 07:30:13 openqa auditd[963]: Audit daemon rotating log files
…
Mar 28 07:35:57 openqa auditd[963]: Audit daemon rotating log files
…
Mar 28 07:38:27 openqa su[13757]: (to postgres) root on none
…
Mar 28 07:38:32 openqa systemd[1]: Started /usr/bin/systemctl disable fstrim.service.
Mar 28 07:38:32 openqa systemd[1]: Reloading requested from client PID 14073 ('systemctl') (unit run-rb7404bc1067547a6b920eb40aabd404c.scope)...
Mar 28 07:38:32 openqa systemd[1]: Reloading...
Mar 28 07:38:32 openqa systemd-fstab-generator[14157]: Checking was requested for "/srv/homes.img", but it is not a device.
…
Mar 28 07:38:37 openqa systemd[1]: Stopping User Manager for UID 26...
Mar 28 07:38:37 openqa systemd[11542]: Activating special unit Exit the Session...
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Main User Target.
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Basic System.
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Paths.
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Sockets.
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Timers.
Mar 28 07:38:37 openqa systemd[11542]: Closed D-Bus User Message Bus Socket.
Mar 28 07:38:37 openqa systemd[11542]: Closed PipeWire Multimedia System Sockets.
Mar 28 07:38:37 openqa systemd[11542]: Removed slice User Application Slice.
Mar 28 07:38:37 openqa systemd[11542]: Reached target Shutdown.
Mar 28 07:38:37 openqa systemd[11542]: Finished Exit the Session.
Mar 28 07:38:37 openqa systemd[11542]: Reached target Exit the Session.
Mar 28 07:38:37 openqa systemd[1]: user@26.service: Deactivated successfully.
Mar 28 07:38:37 openqa systemd[1]: Stopped User Manager for UID 26.
Mar 28 07:38:37 openqa systemd[1]: Stopping User Runtime Directory /run/user/26...
Mar 28 07:38:37 openqa systemd[1]: run-user-26.mount: Deactivated successfully.
Mar 28 07:38:37 openqa systemd[1]: user-runtime-dir@26.service: Deactivated successfully.
…
Mar 28 07:38:38 openqa su[15560]: (to geekotest) root on none
…
Mar 28 07:39:52 openqa auditd[963]: Audit daemon rotating log files
…
Mar 28 07:40:29 openqa auditd[963]: Audit daemon rotating log files
…
Mar 28 07:42:23 openqa auditd[963]: Audit daemon rotating log files
…
Actions #2

Updated by okurz 4 days ago

  • Tags changed from alert, infra to alert, infra, reactive work
  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #3

Updated by okurz 4 days ago

  • Description updated (diff)
Actions #5

Updated by mkittler 4 days ago · Edited

Another spike in this time range: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-74&from=2025-03-28T09%3A37%3A26.702Z&to=2025-03-28T09%3A39%3A47.133Z&timezone=utc&var-host_disks=%24__all

It was only a short spike so I wasn't quick enough with ncdu.

Note that we have 7.3G headroom (in absolute numbers).

The audit log is less than 50 MiB in total so that's probably not it.

User homes are not on the root partition so that's probably also not it.

Actions #7

Updated by mkittler 4 days ago

  • Private changed from Yes to No
Actions #10

Updated by mkittler 4 days ago · Edited

It has just resolved itself:

martchus@openqa:~> df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G   11G  7.4G  60% /

The check with ncdu still looks the same, though:

--- / -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    5.7 GiB [##############################] /usr
    4.0 GiB [#####################         ] /var
  838.0 MiB [####                          ] /lib
  186.6 MiB [                              ] /opt
  141.1 MiB [                              ] /boot
   61.5 MiB [                              ] /root
   34.9 MiB [                              ] /etc
.  28.4 MiB [                              ] /tmp
   11.1 MiB [                              ] /lib64
    5.4 MiB [                              ] /sbin
  860.0 KiB [                              ]  core
  656.0 KiB [                              ] /bin
   44.0 KiB [                              ] /mnt
e  16.0 KiB [                              ] /lost+found
    8.0 KiB [                              ] /.cache
e   4.0 KiB [                              ] /t
e   4.0 KiB [                              ] /storage
e   4.0 KiB [                              ] /selinux
    4.0 KiB [                              ]  .netrwhist
>   0.0   B [                              ] /sys
>   0.0   B [                              ] /srv
>   0.0   B [                              ] /space-slow
>   0.0   B [                              ] /run
>   0.0   B [                              ] /results
>   0.0   B [                              ] /proc
>   0.0   B [                              ] /home
    0.0   B [                              ]  forcefsck
>   0.0   B [                              ] /dev
>   0.0   B [                              ] /assets
    0.0   B [                              ]  2024-07-18.dump

I also see no difference on nested levels.

Note that I used ncdu -x / to avoid checking across file systems. The ncdu run mentioned on #179629#note-9 was definitely performed while df showed the following from start to end:

df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda1        20G   18G  727M  97% /

I actually ran ncdu multiple times while df -h was showing that with consistent results.

Actions #12

Updated by openqa_review 3 days ago

  • Due date set to 2025-04-12

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by livdywan 1 day ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

This wasn't urgent as of late Friday, right?

I suggest we ask for an increase of the root partition 20G->40G and see if we still have problems with spikes

We probably don't want to compromise on audit logs if those consume more space. So a just need an SD ticket here.

Actions #15

Updated by mkittler about 23 hours ago

  • Status changed from In Progress to Feedback

SD ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-184246

When /dev/vda has been increased when cannot just increase /dev/vda1 without first removing /dev/vda2.

Actions

Also available in: Atom PDF