action #115208
closedfailed-systemd-services: logrotate-openqa alerting on and off size:M
0%
Description
Motivation¶
For the third time this week failed-systemd-services has been alerting in the middle of the European night and recovering:
2022-08-11 02:59:52
openqa
logrotate-openqa
1
2022-08-10 17:43:01 QA-Power8-4-kvm openqa-worker-cacheservice-minion 1
Previously only logrotate-openqa was failing.
sudo journalctl -fu logrotate-openqa
Aug 11 09:00:02 openqa logrotate[26982]: Last rotated at 2022-08-11 04:00
Aug 11 09:00:02 openqa logrotate[26982]: log does not need rotating (log size is below the 'size' threshold)
Aug 11 09:00:02 openqa logrotate[26982]: rotating pattern: /var/log/apache2/openqa.access_log 307200000 bytes (20 rotations)
Aug 11 09:00:02 openqa logrotate[26982]: empty log files are not rotated, old logs are removed
Aug 11 09:00:02 openqa logrotate[26982]: considering log /var/log/apache2/openqa.access_log
Aug 11 09:00:02 openqa logrotate[26982]: Now: 2022-08-11 09:00
Aug 11 09:00:02 openqa logrotate[26982]: Last rotated at 2022-08-11 05:00
Aug 11 09:00:02 openqa logrotate[26982]: log does not need rotating (log size is below the 'size' threshold)
Aug 11 09:00:02 openqa systemd[1]: logrotate-openqa.service: Deactivated successfully.
Aug 11 09:00:02 openqa systemd[1]: Finished Rotate openQA log files.
Acceptance criteria¶
- AC1: logrotate-openqa is known to work reliably
- AC2: failed-systemd-services is not alerting on a regular basis
Suggestions¶
- Look into what's causing logrotate-openqa to fail
- Check if there's enough disk space / if the logs are deleted before rotation
Updated by livdywan about 2 years ago
- Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added
Updated by tinita about 2 years ago
If I look at grafana https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1 then the logrotate failure was at 2022-08-11 02:59:52 but the journal only goes back to 06:01:42 this morning. So we don't really know what exactly failed, I guess.
Didn't we increase the journal log size at some point so we have more data?
Updated by livdywan about 2 years ago
- Subject changed from failed-systemd-services: logrotate-openqa alerting on and off to failed-systemd-services: logrotate-openqa alerting on and off size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler about 2 years ago
Haven't seen the problem for 6 days.
Didn't we increase the journal log size at some point so we have more data?
Currently our config looks like this:
[Journal]
SystemMaxUse=80G
SystemKeepFree=20%
SystemMaxFileSize=1G
SystemMaxFiles=200
So a certain storage duration isn't guaranteed; only that our file systems isn't getting too full.
Updated by livdywan about 2 years ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
/etc/systemd/journald.conf.d/journal_size.conf
We don't use /var/log/journal/ and it only contains old files from May
/run/log/journal contains 438MB of files right now
As we (Tina and I) understand the documentation if at boot time /var/log/journal is not writable /var/run/log/journal will be used. systemd-journal-flush
then moves the logs if possible. And the service is active, but... of course we have no logs to confirm if it ran or failed to do so.
A manual journalctl --flush
was successful, and we're able to confirm that /var/log/journal/ is being used now. As a further step we're thinking to test if the same happens after reboot - maybe there is a race with the mount point not being available at all
And maybe we should specify Storage=persistent
to ensure persistent storage?
Updated by tinita about 2 years ago
I just checked the systemd-journal-flush.service service, and it has:
RequiresMountsFor=/var/log/journal
so at least the partition should have been mounted before. The permissions also looked ok.
Updated by tinita about 2 years ago
We just rebooted osd, and saw that there was no error message from the systemd-journal-flush.service service, but the mounting of /srv happened later:
Starting Flush Journal to Persistent Storage...
Finished Flush Journal to Persistent Storage.
...
Mounted /srv
We changed the line in /usr/lib/systemd/system/systemd-journal-flush.service
to use the actual name of the mount point:
RequiresMountsFor=/srv
because /var/log -> /srv/log/
and the RequiresMountsFor
apparently needs the real name of the mount point.
Another reboot was successful, and now logfiles appear under /var/log/journal
again.
Maybe this file should be in salt, as well as /etc/systemd/journald.conf.d/journal_size.conf
Updated by livdywan about 2 years ago
- Status changed from In Progress to Feedback
I suspect the timeout was meant to serve as a work-around so I'm replacing it I'm adding an override to require the mountpoint: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/727/diffs
Updated by livdywan about 2 years ago
- Status changed from Feedback to Resolved
A sudo ls -lH /var/{run/,}log/journal/*
confirms that the logs still look to be in the right place and we actually keep logs like we should again so I would consider this to be resolved.
Updated by okurz about 2 years ago
- Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added