action #179629
open[alert] Root partition on OSD was almost completely full
0%
Description
Observation¶
The root partition on OSD was almost completely full from 2025-03-28 07:29:19 to 2025-03-28 07:40:50. It went back to normal. We should investigate what happened. Considering the usage is back to only 60 % there's probably no need for cleanup right now.
https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-03-21T19:08:44.678Z&to=2025-03-28T09:23:32.922Z&var-host_disks=$__all&refresh=15m&timezone=UTC&viewPanel=panel-74
shows multiple short-time spikes. We should take a look if users log in and copy around assets. If they do yell at them!
Updated by mkittler 4 days ago
Journal on OSD:
Mar 28 07:30:01 openqa systemd[1]: Started Session c38441 of User geekotest.
Mar 28 07:30:01 openqa systemd[1]: Started Session c38442 of User geekotest.
Mar 28 07:30:01 openqa systemd[1]: Started Session c38443 of User geekotest.
Mar 28 07:30:01 openqa systemd[1]: Started Session c38444 of User geekotest.
Mar 28 07:30:01 openqa cron[17472]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Mar 28 07:30:01 openqa cron[17475]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Mar 28 07:30:01 openqa openqa-websockets-daemon[11726]: [debug] [pid:11726] Updating seen of worker 3411 from worker_status (free)
Mar 28 07:30:01 openqa CRON[17513]: (geekotest) CMD ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-12-SP?-SDK-POOL-{x86_64,aarch64,ppc64le,s390x}-BuildGM-Media1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa cron[17473]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Mar 28 07:30:01 openqa CRON[17521]: (geekotest) CMD ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-15-*-s390x-GM-*1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa cron[17476]: pam_unix(crond:session): session opened for user geekotest by (uid=0)
Mar 28 07:30:01 openqa CRON[17523]: (geekotest) CMD ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-12-{,SP?-}Server-DVD-s390x-GM-DVD1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa CRON[17475]: (geekotest) CMDEND ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-12-SP?-SDK-POOL-{x86_64,aarch64,ppc64le,s390x}-BuildGM-Media1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa systemd[1]: session-c38442.scope: Deactivated successfully.
Mar 28 07:30:01 openqa CRON[17475]: pam_unix(crond:session): session closed for user geekotest
Mar 28 07:30:01 openqa CRON[17476]: (geekotest) CMDEND ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-12-{,SP?-}Server-DVD-s390x-GM-DVD1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: [debug] [vZ3Nh_FvG0pg] looking for "autoinst-log.txt" in [
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: "/var/lib/openqa/testresults/17174/17174343-sle-15-SP7-Windows_11_UEFI-x86_64-wsl-main+register:investigate:retry\@win11_uefi",
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: "/var/lib/openqa/testresults/17174/17174343-sle-15-SP7-Windows_11_UEFI-x86_64-wsl-main+register:investigate:retry\@win11_uefi/ulogs",
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: ]
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: [debug] [vZ3Nh_FvG0pg] found bless({
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: path => "/var/lib/openqa/testresults/17174/17174343-sle-15-SP7-Windows_11_UEFI-x86_64-wsl-main+register:investigate:retry\@win11_uefi/autoinst-log.txt",
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: pid => 21686,
Mar 28 07:30:01 openqa openqa-webui-daemon[21686]: }, "Mojo::Asset::File")
Mar 28 07:30:01 openqa CRON[17473]: (geekotest) CMDEND ((cd /var/lib/openqa/share/factory/repo/ && for i in fixed/SLE-15-*-s390x-GM-*1/ ; do ln -sf $i ; done))
Mar 28 07:30:01 openqa CRON[17476]: pam_unix(crond:session): session closed for user geekotest
Mar 28 07:30:01 openqa CRON[17473]: pam_unix(crond:session): session closed for user geekotest
Mar 28 07:30:01 openqa systemd[1]: session-c38444.scope: Deactivated successfully.
Mar 28 07:30:01 openqa systemd[1]: session-c38443.scope: Deactivated successfully.
Mar 28 07:30:02 openqa CRON[17472]: pam_unix(crond:session): session closed for user geekotest
Mar 28 07:30:02 openqa systemd[1]: session-c38441.scope: Deactivated successfully.
…
Mar 28 07:30:13 openqa auditd[963]: Audit daemon rotating log files
…
Mar 28 07:35:57 openqa auditd[963]: Audit daemon rotating log files
…
Mar 28 07:38:27 openqa su[13757]: (to postgres) root on none
…
Mar 28 07:38:32 openqa systemd[1]: Started /usr/bin/systemctl disable fstrim.service.
Mar 28 07:38:32 openqa systemd[1]: Reloading requested from client PID 14073 ('systemctl') (unit run-rb7404bc1067547a6b920eb40aabd404c.scope)...
Mar 28 07:38:32 openqa systemd[1]: Reloading...
Mar 28 07:38:32 openqa systemd-fstab-generator[14157]: Checking was requested for "/srv/homes.img", but it is not a device.
…
Mar 28 07:38:37 openqa systemd[1]: Stopping User Manager for UID 26...
Mar 28 07:38:37 openqa systemd[11542]: Activating special unit Exit the Session...
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Main User Target.
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Basic System.
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Paths.
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Sockets.
Mar 28 07:38:37 openqa systemd[11542]: Stopped target Timers.
Mar 28 07:38:37 openqa systemd[11542]: Closed D-Bus User Message Bus Socket.
Mar 28 07:38:37 openqa systemd[11542]: Closed PipeWire Multimedia System Sockets.
Mar 28 07:38:37 openqa systemd[11542]: Removed slice User Application Slice.
Mar 28 07:38:37 openqa systemd[11542]: Reached target Shutdown.
Mar 28 07:38:37 openqa systemd[11542]: Finished Exit the Session.
Mar 28 07:38:37 openqa systemd[11542]: Reached target Exit the Session.
Mar 28 07:38:37 openqa systemd[1]: user@26.service: Deactivated successfully.
Mar 28 07:38:37 openqa systemd[1]: Stopped User Manager for UID 26.
Mar 28 07:38:37 openqa systemd[1]: Stopping User Runtime Directory /run/user/26...
Mar 28 07:38:37 openqa systemd[1]: run-user-26.mount: Deactivated successfully.
Mar 28 07:38:37 openqa systemd[1]: user-runtime-dir@26.service: Deactivated successfully.
…
Mar 28 07:38:38 openqa su[15560]: (to geekotest) root on none
…
Mar 28 07:39:52 openqa auditd[963]: Audit daemon rotating log files
…
Mar 28 07:40:29 openqa auditd[963]: Audit daemon rotating log files
…
Mar 28 07:42:23 openqa auditd[963]: Audit daemon rotating log files
…
Updated by mkittler 4 days ago · Edited
Another spike in this time range: https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=panel-74&from=2025-03-28T09%3A37%3A26.702Z&to=2025-03-28T09%3A39%3A47.133Z&timezone=utc&var-host_disks=%24__all
It was only a short spike so I wasn't quick enough with ncdu.
Note that we have 7.3G headroom (in absolute numbers).
The audit log is less than 50 MiB in total so that's probably not it.
User homes are not on the root partition so that's probably also not it.
Updated by mkittler 4 days ago · Edited
It has just resolved itself:
martchus@openqa:~> df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 11G 7.4G 60% /
The check with ncdu still looks the same, though:
--- / -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5.7 GiB [##############################] /usr
4.0 GiB [##################### ] /var
838.0 MiB [#### ] /lib
186.6 MiB [ ] /opt
141.1 MiB [ ] /boot
61.5 MiB [ ] /root
34.9 MiB [ ] /etc
. 28.4 MiB [ ] /tmp
11.1 MiB [ ] /lib64
5.4 MiB [ ] /sbin
860.0 KiB [ ] core
656.0 KiB [ ] /bin
44.0 KiB [ ] /mnt
e 16.0 KiB [ ] /lost+found
8.0 KiB [ ] /.cache
e 4.0 KiB [ ] /t
e 4.0 KiB [ ] /storage
e 4.0 KiB [ ] /selinux
4.0 KiB [ ] .netrwhist
> 0.0 B [ ] /sys
> 0.0 B [ ] /srv
> 0.0 B [ ] /space-slow
> 0.0 B [ ] /run
> 0.0 B [ ] /results
> 0.0 B [ ] /proc
> 0.0 B [ ] /home
0.0 B [ ] forcefsck
> 0.0 B [ ] /dev
> 0.0 B [ ] /assets
0.0 B [ ] 2024-07-18.dump
I also see no difference on nested levels.
Note that I used ncdu -x /
to avoid checking across file systems. The ncdu run mentioned on #179629#note-9 was definitely performed while df
showed the following from start to end:
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 20G 18G 727M 97% /
I actually ran ncdu
multiple times while df -h
was showing that with consistent results.
Updated by openqa_review 3 days ago
- Due date set to 2025-04-12
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan 1 day ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
This wasn't urgent as of late Friday, right?
I suggest we ask for an increase of the root partition 20G->40G and see if we still have problems with spikes
We probably don't want to compromise on audit logs if those consume more space. So a just need an SD ticket here.
Updated by mkittler about 24 hours ago
- Status changed from Feedback to In Progress
We did had another spike (https://monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&from=2025-03-29T11%3A27%3A41.155Z&to=2025-03-29T12%3A03%3A57.831Z&timezone=utc&var-host_disks=%24__all&viewPanel=panel-74) so I'm creating an SD ticket.
Updated by mkittler about 23 hours ago
- Status changed from In Progress to Feedback
SD ticket: https://sd.suse.com/servicedesk/customer/portal/1/SD-184246
When /dev/vda
has been increased when cannot just increase /dev/vda1
without first removing /dev/vda2
.