action #104967
closedFile systems alert repeatedly triggering on and off
Added by livdywan over 2 years ago. Updated over 2 years ago.
0%
Description
The alerts seem to be triggering for /srv on and off, and you can see that the percentage is hovering around the 90 percent mark:
/srv: Used Percentage
90.150
/srv: Used Percentage
90.069
/srv: Used Percentage
90.119
/srv: Used Percentage
90.008
/srv: Used Percentage
90.119
/srv: Used Percentage
90.126
/srv: Used Percentage
90.025
Updated by mkittler over 2 years ago
- Status changed from New to In Progress
/srv
always used to be < 60 % and grew only recently (around 14-01-2022): https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=74&orgId=1&from=1642081748259&to=1642328193630
I'll have a look what's taking the space.
Updated by mkittler over 2 years ago
Looks like the home dirs are on /srv
:
/dev/vdb on /srv type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/srv/homes.img on /home type ext4 (rw,relatime)
However, the main culprits are PostgreSQL and logs:
--- /srv
48,5 GiB [##########] /PSQL
40,4 GiB [######## ] /log
1,2 GiB [ ] homes.img
30,6 MiB [ ] /salt
12,8 MiB [ ] /pillar
4,0 KiB [ ] /reactor
0,0 B [ ] /www
e 0,0 B [ ] /tftpboot
e 0,0 B [ ] /svn
e 0,0 B [ ] /spm
e 0,0 B [ ] /ftp
e 0,0 B [ ] /backup
I've been removing the data dir of the previous PostgreSQL version but it only gained a few MiB because it was mostly only hardlinks (with refcount > 1) anyways.
Now the question is whether PSQL or logs grew. Regardless of that, having 40,4 GiB logs seems quite a log. It is mostly the journal:
--- /srv/log
/..
37,3 GiB [##########] /journal
1,4 GiB [ ] /apache2
493,4 MiB [ ] openqa.1-2021112106.backup
470,5 MiB [ ] openqa.1-2021110705.backup
100,6 MiB [ ] openqa
Note that the retention isn't actually that long:
martchus@openqa:/srv> sudo journalctl
-- Logs begin at Wed 2022-01-12 03:15:16 CET, end at Mon 2022-01-17 12:41:19 CET. --
Jan 12 03:15:16 openqa systemd[27882]: run-user-17307.mount: Succeeded.
Updated by okurz over 2 years ago
- Related to action #96551: Persistent records of systemd journal size:S added
Updated by okurz over 2 years ago
I changed the log config in #96551 recently to not store the openQA logs in plain files but in system journal again. Also there was https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/607 which I recently merged about apache log rotation
Updated by mkittler over 2 years ago
About the database: It might have grown as well, in fact there's #100979 being new/unresolved. However, judging by figures in #100859 I suppose 48.3 GiB for the database is still in the normal range. So it is really likely the logs which have grown.
I changed the log config in #96551 recently to not store the openQA logs in plain files but in system journal again.
I suppose that explains it. Do you want to take care of it in #96551 or should I do something? I'm not sure what, though. I don't want to just revert your effort.
Updated by mkittler over 2 years ago
This PR should help a little bit (with our way too verbose logging): https://github.com/os-autoinst/openQA/pull/4453
I also paused the alert for now.
Updated by mkittler over 2 years ago
The disk usage is not increasing anymore but still moves around 90 % which is exactly our alert threshold. So I'll keep the alert paused for now. I suppose we have to limit the log retention again.
Updated by openqa_review over 2 years ago
- Due date set to 2022-02-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 2 years ago
- Status changed from In Progress to Feedback
As discussed during weekly unblock I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/642 to keep enough space to not hit the alerting threshold. Improvements regarding logging content should go into #96551
Updated by mkittler over 2 years ago
- Status changed from Feedback to Resolved
There were some problems, see #96551#note-20 and subsequent comments.
Since SystemKeepFree=20%
is not enforced after being already exceeded (see https://www.freedesktop.org/software/systemd/man/journald.conf.html#SystemMaxUse=) I've invoked sudo journalctl --vacuum-size=26G
manually. We're now at 80 % as expected so I resumed the alert.