action #115208
closed
failed-systemd-services: logrotate-openqa alerting on and off size:M
Added by livdywan over 2 years ago.
Updated over 2 years ago.
Description
Motivation¶
For the third time this week failed-systemd-services has been alerting in the middle of the European night and recovering:
2022-08-11 02:59:52
openqa
logrotate-openqa
1
2022-08-10 17:43:01 QA-Power8-4-kvm openqa-worker-cacheservice-minion 1
Previously only logrotate-openqa was failing.
sudo journalctl -fu logrotate-openqa
Aug 11 09:00:02 openqa logrotate[26982]: Last rotated at 2022-08-11 04:00
Aug 11 09:00:02 openqa logrotate[26982]: log does not need rotating (log size is below the 'size' threshold)
Aug 11 09:00:02 openqa logrotate[26982]: rotating pattern: /var/log/apache2/openqa.access_log 307200000 bytes (20 rotations)
Aug 11 09:00:02 openqa logrotate[26982]: empty log files are not rotated, old logs are removed
Aug 11 09:00:02 openqa logrotate[26982]: considering log /var/log/apache2/openqa.access_log
Aug 11 09:00:02 openqa logrotate[26982]: Now: 2022-08-11 09:00
Aug 11 09:00:02 openqa logrotate[26982]: Last rotated at 2022-08-11 05:00
Aug 11 09:00:02 openqa logrotate[26982]: log does not need rotating (log size is below the 'size' threshold)
Aug 11 09:00:02 openqa systemd[1]: logrotate-openqa.service: Deactivated successfully.
Aug 11 09:00:02 openqa systemd[1]: Finished Rotate openQA log files.
Acceptance criteria¶
- AC1: logrotate-openqa is known to work reliably
- AC2: failed-systemd-services is not alerting on a regular basis
Suggestions¶
- Look into what's causing logrotate-openqa to fail
- Check if there's enough disk space / if the logs are deleted before rotation
- Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added
- Target version set to Ready
- Subject changed from failed-systemd-services: logrotate-openqa alerting on and off to failed-systemd-services: logrotate-openqa alerting on and off size:M
- Description updated (diff)
- Status changed from New to Workable
Haven't seen the problem for 6 days.
Didn't we increase the journal log size at some point so we have more data?
Currently our config looks like this:
[Journal]
SystemMaxUse=80G
SystemKeepFree=20%
SystemMaxFileSize=1G
SystemMaxFiles=200
So a certain storage duration isn't guaranteed; only that our file systems isn't getting too full.
- Status changed from Workable to In Progress
- Assignee set to livdywan
/etc/systemd/journald.conf.d/journal_size.conf
We don't use /var/log/journal/ and it only contains old files from May
/run/log/journal contains 438MB of files right now
As we (Tina and I) understand the documentation if at boot time /var/log/journal is not writable /var/run/log/journal will be used. systemd-journal-flush
then moves the logs if possible. And the service is active, but... of course we have no logs to confirm if it ran or failed to do so.
A manual journalctl --flush
was successful, and we're able to confirm that /var/log/journal/ is being used now. As a further step we're thinking to test if the same happens after reboot - maybe there is a race with the mount point not being available at all
And maybe we should specify Storage=persistent
to ensure persistent storage?
I just checked the systemd-journal-flush.service service, and it has:
RequiresMountsFor=/var/log/journal
so at least the partition should have been mounted before. The permissions also looked ok.
We just rebooted osd, and saw that there was no error message from the systemd-journal-flush.service service, but the mounting of /srv happened later:
Starting Flush Journal to Persistent Storage...
Finished Flush Journal to Persistent Storage.
...
Mounted /srv
We changed the line in /usr/lib/systemd/system/systemd-journal-flush.service
to use the actual name of the mount point:
RequiresMountsFor=/srv
because /var/log -> /srv/log/
and the RequiresMountsFor
apparently needs the real name of the mount point.
Another reboot was successful, and now logfiles appear under /var/log/journal
again.
Maybe this file should be in salt, as well as /etc/systemd/journald.conf.d/journal_size.conf
- Status changed from In Progress to Feedback
- Status changed from Feedback to Resolved
A sudo ls -lH /var/{run/,}log/journal/*
confirms that the logs still look to be in the right place and we actually keep logs like we should again so I would consider this to be resolved.
- Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added
Also available in: Atom
PDF