Project

General

Profile

Actions

action #115208

closed

failed-systemd-services: logrotate-openqa alerting on and off size:M

Added by livdywan over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

For the third time this week failed-systemd-services has been alerting in the middle of the European night and recovering:

2022-08-11 02:59:52 
openqa  
logrotate-openqa    
1
2022-08-10 17:43:01 QA-Power8-4-kvm openqa-worker-cacheservice-minion   1

Previously only logrotate-openqa was failing.

sudo journalctl -fu logrotate-openqa
Aug 11 09:00:02 openqa logrotate[26982]:   Last rotated at 2022-08-11 04:00
Aug 11 09:00:02 openqa logrotate[26982]:   log does not need rotating (log size is below the 'size' threshold)
Aug 11 09:00:02 openqa logrotate[26982]: rotating pattern: /var/log/apache2/openqa.access_log  307200000 bytes (20 rotations)
Aug 11 09:00:02 openqa logrotate[26982]: empty log files are not rotated, old logs are removed
Aug 11 09:00:02 openqa logrotate[26982]: considering log /var/log/apache2/openqa.access_log
Aug 11 09:00:02 openqa logrotate[26982]:   Now: 2022-08-11 09:00
Aug 11 09:00:02 openqa logrotate[26982]:   Last rotated at 2022-08-11 05:00
Aug 11 09:00:02 openqa logrotate[26982]:   log does not need rotating (log size is below the 'size' threshold)
Aug 11 09:00:02 openqa systemd[1]: logrotate-openqa.service: Deactivated successfully.
Aug 11 09:00:02 openqa systemd[1]: Finished Rotate openQA log files.

Acceptance criteria

  • AC1: logrotate-openqa is known to work reliably
  • AC2: failed-systemd-services is not alerting on a regular basis

Suggestions

  • Look into what's causing logrotate-openqa to fail
  • Check if there's enough disk space / if the logs are deleted before rotation

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #114565: recover qa-power8-4+qa-power8-5 size:MResolvedokurz2022-12-19

Actions
Related to openQA Infrastructure (public) - action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:MResolvedmkittler2022-09-18

Actions
Actions #1

Updated by livdywan over 2 years ago

  • Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added
Actions #2

Updated by tinita over 2 years ago

If I look at grafana https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1 then the logrotate failure was at 2022-08-11 02:59:52 but the journal only goes back to 06:01:42 this morning. So we don't really know what exactly failed, I guess.

Didn't we increase the journal log size at some point so we have more data?

Actions #3

Updated by okurz over 2 years ago

  • Target version set to Ready
Actions #4

Updated by livdywan over 2 years ago

  • Subject changed from failed-systemd-services: logrotate-openqa alerting on and off to failed-systemd-services: logrotate-openqa alerting on and off size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler over 2 years ago

Haven't seen the problem for 6 days.


Didn't we increase the journal log size at some point so we have more data?

Currently our config looks like this:

[Journal]
SystemMaxUse=80G
SystemKeepFree=20%
SystemMaxFileSize=1G
SystemMaxFiles=200

So a certain storage duration isn't guaranteed; only that our file systems isn't getting too full.

Actions #6

Updated by livdywan over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

/etc/systemd/journald.conf.d/journal_size.conf
We don't use /var/log/journal/ and it only contains old files from May
/run/log/journal contains 438MB of files right now
As we (Tina and I) understand the documentation if at boot time /var/log/journal is not writable /var/run/log/journal will be used. systemd-journal-flush then moves the logs if possible. And the service is active, but... of course we have no logs to confirm if it ran or failed to do so.
A manual journalctl --flush was successful, and we're able to confirm that /var/log/journal/ is being used now. As a further step we're thinking to test if the same happens after reboot - maybe there is a race with the mount point not being available at all
And maybe we should specify Storage=persistent to ensure persistent storage?

Actions #7

Updated by tinita over 2 years ago

I just checked the systemd-journal-flush.service service, and it has:

RequiresMountsFor=/var/log/journal

so at least the partition should have been mounted before. The permissions also looked ok.

Actions #8

Updated by tinita over 2 years ago

We just rebooted osd, and saw that there was no error message from the systemd-journal-flush.service service, but the mounting of /srv happened later:

Starting Flush Journal to Persistent Storage...
Finished Flush Journal to Persistent Storage.
...
Mounted /srv

We changed the line in /usr/lib/systemd/system/systemd-journal-flush.service to use the actual name of the mount point:

RequiresMountsFor=/srv

because /var/log -> /srv/log/ and the RequiresMountsFor apparently needs the real name of the mount point.

Another reboot was successful, and now logfiles appear under /var/log/journal again.

Maybe this file should be in salt, as well as /etc/systemd/journald.conf.d/journal_size.conf

Actions #9

Updated by livdywan over 2 years ago

  • Status changed from In Progress to Feedback

I suspect the timeout was meant to serve as a work-around so I'm replacing it I'm adding an override to require the mountpoint: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/727/diffs

Actions #10

Updated by livdywan over 2 years ago

  • Status changed from Feedback to Resolved

A sudo ls -lH /var/{run/,}log/journal/* confirms that the logs still look to be in the right place and we actually keep logs like we should again so I would consider this to be resolved.

Actions #11

Updated by okurz over 2 years ago

  • Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added
Actions

Also available in: Atom PDF