Project

General

Profile

Actions

action #104967

closed

File systems alert repeatedly triggering on and off

Added by livdywan about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2022-01-17
Due date:
2022-02-02
% Done:

0%

Estimated time:

Description

The alerts seem to be triggering for /srv on and off, and you can see that the percentage is hovering around the 90 percent mark:

/srv: Used Percentage

90.150

/srv: Used Percentage

90.069

/srv: Used Percentage

90.119

/srv: Used Percentage

90.008

/srv: Used Percentage

90.119

/srv: Used Percentage

90.126

/srv: Used Percentage

90.025

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #96551: Persistent records of systemd journal size:SResolvedokurz2021-10-22

Actions
Actions #1

Updated by mkittler about 2 years ago

  • Assignee set to mkittler
Actions #2

Updated by mkittler about 2 years ago

  • Status changed from New to In Progress

/srv always used to be < 60 % and grew only recently (around 14-01-2022): https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?tab=alert&viewPanel=74&orgId=1&from=1642081748259&to=1642328193630

I'll have a look what's taking the space.

Actions #3

Updated by mkittler about 2 years ago

Looks like the home dirs are on /srv:

/dev/vdb on /srv type xfs (rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota)
/srv/homes.img on /home type ext4 (rw,relatime)

However, the main culprits are PostgreSQL and logs:

--- /srv
   48,5 GiB [##########] /PSQL                                                                                                                                                                                                                                                                                              
   40,4 GiB [########  ] /log
    1,2 GiB [          ]  homes.img
   30,6 MiB [          ] /salt
   12,8 MiB [          ] /pillar
    4,0 KiB [          ] /reactor
    0,0   B [          ] /www
e   0,0   B [          ] /tftpboot
e   0,0   B [          ] /svn
e   0,0   B [          ] /spm
e   0,0   B [          ] /ftp
e   0,0   B [          ] /backup

I've been removing the data dir of the previous PostgreSQL version but it only gained a few MiB because it was mostly only hardlinks (with refcount > 1) anyways.

Now the question is whether PSQL or logs grew. Regardless of that, having 40,4 GiB logs seems quite a log. It is mostly the journal:

--- /srv/log
                         /..
   37,3 GiB [##########] /journal                                                                                                                                                                                                                                                                                           
    1,4 GiB [          ] /apache2
  493,4 MiB [          ]  openqa.1-2021112106.backup
  470,5 MiB [          ]  openqa.1-2021110705.backup
  100,6 MiB [          ]  openqa

Note that the retention isn't actually that long:

martchus@openqa:/srv> sudo journalctl
-- Logs begin at Wed 2022-01-12 03:15:16 CET, end at Mon 2022-01-17 12:41:19 CET. --
Jan 12 03:15:16 openqa systemd[27882]: run-user-17307.mount: Succeeded.
Actions #4

Updated by okurz about 2 years ago

  • Related to action #96551: Persistent records of systemd journal size:S added
Actions #5

Updated by okurz about 2 years ago

I changed the log config in #96551 recently to not store the openQA logs in plain files but in system journal again. Also there was https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/607 which I recently merged about apache log rotation

Actions #6

Updated by mkittler about 2 years ago

About the database: It might have grown as well, in fact there's #100979 being new/unresolved. However, judging by figures in #100859 I suppose 48.3 GiB for the database is still in the normal range. So it is really likely the logs which have grown.

I changed the log config in #96551 recently to not store the openQA logs in plain files but in system journal again.

I suppose that explains it. Do you want to take care of it in #96551 or should I do something? I'm not sure what, though. I don't want to just revert your effort.

Actions #7

Updated by mkittler about 2 years ago

This PR should help a little bit (with our way too verbose logging): https://github.com/os-autoinst/openQA/pull/4453

I also paused the alert for now.

Actions #8

Updated by mkittler about 2 years ago

The disk usage is not increasing anymore but still moves around 90 % which is exactly our alert threshold. So I'll keep the alert paused for now. I suppose we have to limit the log retention again.

Actions #9

Updated by openqa_review about 2 years ago

  • Due date set to 2022-02-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz about 2 years ago

  • Status changed from In Progress to Feedback

As discussed during weekly unblock I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/642 to keep enough space to not hit the alerting threshold. Improvements regarding logging content should go into #96551

Actions #11

Updated by mkittler about 2 years ago

  • Status changed from Feedback to Resolved

There were some problems, see #96551#note-20 and subsequent comments.

Since SystemKeepFree=20% is not enforced after being already exceeded (see https://www.freedesktop.org/software/systemd/man/journald.conf.html#SystemMaxUse=) I've invoked sudo journalctl --vacuum-size=26G manually. We're now at 80 % as expected so I resumed the alert.

Actions

Also available in: Atom PDF