action #115208: failed-systemd-services: logrotate-openqa alerting on and off size:M - openQA Infrastructure - openSUSE Project Management Tool

Actions

Copy link

action #115208

closed

failed-systemd-services: logrotate-openqa alerting on and off size:M

Added by livdywan about 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

livdywan

Category:

Target version:

openQA Project - Ready

Start date:

Due date:

% Done:

Estimated time:

Description

Motivation¶

For the third time this week failed-systemd-services has been alerting in the middle of the European night and recovering:

2022-08-11 02:59:52 
openqa  
logrotate-openqa    
1
2022-08-10 17:43:01 QA-Power8-4-kvm openqa-worker-cacheservice-minion   1

Previously only logrotate-openqa was failing.

sudo journalctl -fu logrotate-openqa
Aug 11 09:00:02 openqa logrotate[26982]:   Last rotated at 2022-08-11 04:00
Aug 11 09:00:02 openqa logrotate[26982]:   log does not need rotating (log size is below the 'size' threshold)
Aug 11 09:00:02 openqa logrotate[26982]: rotating pattern: /var/log/apache2/openqa.access_log  307200000 bytes (20 rotations)
Aug 11 09:00:02 openqa logrotate[26982]: empty log files are not rotated, old logs are removed
Aug 11 09:00:02 openqa logrotate[26982]: considering log /var/log/apache2/openqa.access_log
Aug 11 09:00:02 openqa logrotate[26982]:   Now: 2022-08-11 09:00
Aug 11 09:00:02 openqa logrotate[26982]:   Last rotated at 2022-08-11 05:00
Aug 11 09:00:02 openqa logrotate[26982]:   log does not need rotating (log size is below the 'size' threshold)
Aug 11 09:00:02 openqa systemd[1]: logrotate-openqa.service: Deactivated successfully.
Aug 11 09:00:02 openqa systemd[1]: Finished Rotate openQA log files.

Acceptance criteria¶

AC1: logrotate-openqa is known to work reliably
AC2: failed-systemd-services is not alerting on a regular basis

Suggestions¶

Look into what's causing logrotate-openqa to fail
Check if there's enough disk space / if the logs are deleted before rotation

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by livdywan about 2 years ago

Related to action #114565: recover qa-power8-4+qa-power8-5 size:M added

Actions

Copy link

Updated by tinita about 2 years ago

If I look at grafana https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1 then the logrotate failure was at 2022-08-11 02:59:52 but the journal only goes back to 06:01:42 this morning. So we don't really know what exactly failed, I guess.

Didn't we increase the journal log size at some point so we have more data?

Actions

Copy link

Updated by okurz about 2 years ago

Target version set to Ready

Actions

Copy link

Updated by livdywan about 2 years ago

Subject changed from failed-systemd-services: logrotate-openqa alerting on and off to failed-systemd-services: logrotate-openqa alerting on and off size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler about 2 years ago

Haven't seen the problem for 6 days.

Didn't we increase the journal log size at some point so we have more data?

Currently our config looks like this:

[Journal]
SystemMaxUse=80G
SystemKeepFree=20%
SystemMaxFileSize=1G
SystemMaxFiles=200

So a certain storage duration isn't guaranteed; only that our file systems isn't getting too full.

Actions

Copy link

Updated by livdywan about 2 years ago

Status changed from Workable to In Progress
Assignee set to livdywan

/etc/systemd/journald.conf.d/journal_size.conf
We don't use /var/log/journal/ and it only contains old files from May
/run/log/journal contains 438MB of files right now
As we (Tina and I) understand the documentation if at boot time /var/log/journal is not writable /var/run/log/journal will be used. systemd-journal-flush then moves the logs if possible. And the service is active, but... of course we have no logs to confirm if it ran or failed to do so.
A manual journalctl --flush was successful, and we're able to confirm that /var/log/journal/ is being used now. As a further step we're thinking to test if the same happens after reboot - maybe there is a race with the mount point not being available at all
And maybe we should specify Storage=persistent to ensure persistent storage?

Actions

Copy link

Updated by tinita about 2 years ago

I just checked the systemd-journal-flush.service service, and it has:

RequiresMountsFor=/var/log/journal

so at least the partition should have been mounted before. The permissions also looked ok.

Actions

Copy link

Updated by tinita about 2 years ago

We just rebooted osd, and saw that there was no error message from the systemd-journal-flush.service service, but the mounting of /srv happened later:

Starting Flush Journal to Persistent Storage...
Finished Flush Journal to Persistent Storage.
...
Mounted /srv

We changed the line in /usr/lib/systemd/system/systemd-journal-flush.service to use the actual name of the mount point:

RequiresMountsFor=/srv

because /var/log -> /srv/log/ and the RequiresMountsFor apparently needs the real name of the mount point.

Another reboot was successful, and now logfiles appear under /var/log/journal again.

Maybe this file should be in salt, as well as /etc/systemd/journald.conf.d/journal_size.conf

Actions

Copy link

Updated by livdywan about 2 years ago

Status changed from In Progress to Feedback

~~I suspect the timeout was meant to serve as a work-around so I'm replacing it~~ I'm adding an override to require the mountpoint: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/727/diffs

Actions

Copy link

#10

Updated by livdywan about 2 years ago

Status changed from Feedback to Resolved

A sudo ls -lH /var/{run/,}log/journal/* confirms that the logs still look to be in the right place and we actually keep logs like we should again so I would consider this to be resolved.

Actions

Copy link

#11

Updated by okurz about 2 years ago

Related to action #116722: openqa.suse.de is not reachable 2022-09-18, no ping response, postgreSQL OOM and kernel panics size:M added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA » openQA Project » openQA Infrastructure

Tags

Custom queries

action #115208

failed-systemd-services: logrotate-openqa alerting on and off size:M

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by livdywan about 2 years ago

Updated by tinita about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago

Updated by mkittler about 2 years ago

Updated by livdywan about 2 years ago

Updated by tinita about 2 years ago

Updated by tinita about 2 years ago

Updated by livdywan about 2 years ago

Updated by livdywan about 2 years ago

Updated by okurz about 2 years ago