action #124412
closed[alert] logrotate services failed on openqa-piworker.qa.suse.de and OSD size:M
0%
Description
Observation¶
See https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1676150994873&to=1676257956411 for the time frame. (It also shows some mount units but they were only failing very shortly.)
This is the log of logratate-openqa on OSD:
Feb 13 00:00:03 openqa logrotate[30440]: Now: 2023-02-13 00:00
Feb 13 00:00:03 openqa logrotate[30440]: Last rotated at 2023-02-12 05:00
Feb 13 00:00:03 openqa logrotate[30440]: log does not need rotating (log size is below the 'size' threshold)
Feb 13 00:00:03 openqa systemd[1]: logrotate-openqa.service: Deactivated successfully.
Feb 13 00:00:03 openqa systemd[1]: Finished Rotate openQA log files.
Feb 13 01:00:00 openqa systemd[1]: Starting Rotate openQA log files...
Feb 13 01:00:01 openqa logrotate[11051]: reading config file /etc/logrotate.d/openqa
Feb 13 01:00:01 openqa logrotate[11051]: warning: 'size' overrides previously specified 'hourly'
Feb 13 01:00:01 openqa logrotate[11051]: compress_prog is now /usr/bin/xz
Feb 13 01:00:01 openqa logrotate[11051]: compress_ext was changed to .xz
Feb 13 01:00:01 openqa logrotate[11051]: uncompress_prog is now /usr/bin/xzdec
Feb 13 01:00:01 openqa logrotate[11051]: warning: 'size' overrides previously specified 'hourly'
Feb 13 01:00:01 openqa logrotate[11051]: compress_prog is now /usr/bin/xz
Feb 13 01:00:01 openqa logrotate[11051]: compress_ext was changed to .xz
Feb 13 01:00:01 openqa logrotate[11051]: uncompress_prog is now /usr/bin/xzdec
Feb 13 01:00:01 openqa logrotate[11051]: reading config file /etc/logrotate.d/openqa-apache
Feb 13 01:00:01 openqa logrotate[11051]: warning: 'size' overrides previously specified 'hourly'
Feb 13 01:00:01 openqa logrotate[11051]: compress_prog is now /usr/bin/xz
Feb 13 01:00:01 openqa logrotate[11051]: compress_ext was changed to .xz
Feb 13 01:00:01 openqa logrotate[11051]: uncompress_prog is now /usr/bin/xzdec
Feb 13 01:00:01 openqa logrotate[11051]: error: state file /var/lib/misc/logrotate.status is already locked
Feb 13 01:00:01 openqa logrotate[11051]: logrotate does not support parallel execution on the same set of logfiles.
Feb 13 01:00:01 openqa systemd[1]: logrotate-openqa.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Feb 13 01:00:01 openqa systemd[1]: logrotate-openqa.service: Failed with result 'exit-code'.
Feb 13 01:00:01 openqa systemd[1]: Failed to start Rotate openQA log files.
Feb 13 02:00:01 openqa systemd[1]: Starting Rotate openQA log files...
Feb 13 02:00:01 openqa logrotate[12997]: reading config file /etc/logrotate.d/openqa
The log for logrotate on the Pi is unfortunately empty. So there's likely not much we can do about the Pi at this point.
I haven't paused the alerts. Let's see whether this is happening again.
Acceptance criteria¶
- AC1: The pi worker no longer triggers the failed systemd services alert
- AC2: OSD no longer triggers the failed systemd services alert
Note that the issue on OSD was a one-time issue and is at this point no concern anymore. It happened again on OSD as well:
Feb 26 00:00:03 openqa systemd[1]: Finished Rotate openQA log files.
Feb 26 01:00:00 openqa systemd[1]: Starting Rotate openQA log files...
Feb 26 01:00:00 openqa logrotate[16932]: reading config file /etc/logrotate.d/openqa
Feb 26 01:00:00 openqa logrotate[16932]: warning: 'size' overrides previously specified 'hourly'
Feb 26 01:00:00 openqa logrotate[16932]: compress_prog is now /usr/bin/xz
Feb 26 01:00:00 openqa logrotate[16932]: compress_ext was changed to .xz
Feb 26 01:00:00 openqa logrotate[16932]: uncompress_prog is now /usr/bin/xzdec
Feb 26 01:00:00 openqa logrotate[16932]: warning: 'size' overrides previously specified 'hourly'
Feb 26 01:00:00 openqa logrotate[16932]: compress_prog is now /usr/bin/xz
Feb 26 01:00:00 openqa logrotate[16932]: compress_ext was changed to .xz
Feb 26 01:00:00 openqa logrotate[16932]: uncompress_prog is now /usr/bin/xzdec
Feb 26 01:00:00 openqa logrotate[16932]: reading config file /etc/logrotate.d/openqa-apache
Feb 26 01:00:00 openqa logrotate[16932]: warning: 'size' overrides previously specified 'hourly'
Feb 26 01:00:00 openqa logrotate[16932]: compress_prog is now /usr/bin/xz
Feb 26 01:00:00 openqa logrotate[16932]: compress_ext was changed to .xz
Feb 26 01:00:00 openqa logrotate[16932]: uncompress_prog is now /usr/bin/xzdec
Feb 26 01:00:00 openqa logrotate[16932]: error: state file /var/lib/misc/logrotate.status is already locked
Feb 26 01:00:00 openqa logrotate[16932]: logrotate does not support parallel execution on the same set of logfiles.
Feb 26 01:00:00 openqa systemd[1]: logrotate-openqa.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Feb 26 01:00:00 openqa systemd[1]: logrotate-openqa.service: Failed with result 'exit-code'.
Feb 26 01:00:00 openqa systemd[1]: Failed to start Rotate openQA log files.
Feb 26 02:00:00 openqa systemd[1]: Starting Rotate openQA log files...
Suggestions¶
- Look into log files on osd for details (as above)
- Check what happens if logrotate is called manually. It's rather safe to call logrotate again but might trigger the above
- Likely an unclean shutdown caused this. Try to trigger manually.
- Ask @dheidler to fix the Pi-worker
Updated by okurz over 1 year ago
- Description updated (diff)
- Status changed from New to Resolved
- Assignee set to okurz
As discussed in unblock: Likely an unclean shutdown caused this and we don't need to handle that now. We will see if we have the issue popping up again.
Updated by jbaier_cz over 1 year ago
- Status changed from Resolved to New
- Assignee deleted (
okurz)
It is happening again on openqa-piworker.qa.suse.de.
Updated by mkittler over 1 year ago
- Subject changed from [alert] logrotate services failed on openqa-piworker.qa.suse.de and OSD to [alert] logrotate services failed on openqa-piworker.qa.suse.de and OSD size:M
- Description updated (diff)
Updated by mkittler over 1 year ago
It can't be that logrotate.service
runs into a timeout because the service it oneshot
and the docs say:
Defaults to DefaultTimeoutStartSec= set in the manager, except when Type=oneshot is used, in which case the timeout is disabled by default (see systemd-system.conf(5)).
Since we don't have a persistent journal on the pi-worker it is hard to tell why the service failed.
@dheidler So this actually is about logrotate.service
. I confused it with systemd-journal-flush.service
/systemd-journald.service
last time we discussed this. This also means we likely shouldn't just ignore it in our alerting.
Updated by mkittler over 1 year ago
Annoyingly, this problem is actually still happening - on the piworker and OSD: https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1677358457916&to=1677476057916
Updated by mkittler over 1 year ago
- Description updated (diff)
I'm wondering about the difference between logrotate-openqa.service
(that one fails on OSD) and logrotate.service
(that one fails on the pi-worker). On OSD we have actually both services running:
martchus@openqa:~> sudo systemctl status logrotate.timer logrotate-openqa.timer
● logrotate.timer - Daily rotation of log files
Loaded: loaded (/usr/lib/systemd/system/logrotate.timer; enabled; vendor preset: enabled)
Active: active (waiting) since Sun 2023-02-26 03:32:33 CET; 1 day 10h ago
Trigger: Tue 2023-02-28 00:00:00 CET; 9h left
Triggers: ● logrotate.service
Docs: man:logrotate(8)
man:logrotate.conf(5)
Feb 26 03:32:33 openqa systemd[1]: Started Daily rotation of log files.
● logrotate-openqa.timer - Hourly rotation of openqa log files
Loaded: loaded (/etc/systemd/system/logrotate-openqa.timer; enabled; vendor preset: disabled)
Active: active (waiting) since Sun 2023-02-26 03:32:33 CET; 1 day 10h ago
Trigger: Mon 2023-02-27 15:00:00 CET; 43min left
Triggers: ● logrotate-openqa.service
Docs: man:logrotate(8)
man:logrotate.conf(5)
Feb 26 03:32:33 openqa systemd[1]: Started Hourly rotation of openqa log files.
Both services are triggered via timers. Is that wanted? Maybe it is the reason why logrotate complains about the parallel execution.
Updated by mkittler over 1 year ago
This PR might help with the issue we see on OSD: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/799
It looks like our salt states only mess with logrotate in the openqa.server
role. So the logrotate setup on the pi-worker is most likely the default setup of Leap 15.4. That means the problem on the pi-worker must be a different one as there are no conflicting services. Without persistent logs it is still impossible to pin-down, though.
If we wanted to ignore the logrotate service on the pi-worker we could use the --exclude
option of salt-states-openqa/monitoring/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh
added by a3050d28b5143a8d9052c04dbb946c3161f9bf1d
. We'd only needed to add a way to specify additional parameters for this script via pillars.
Updated by okurz over 1 year ago
Regarding the journal on the piworker we have persistent log files for various services but not the system journal. We suggest to make the system journal persistent, at least for a limited time and monitor for subsequent failures.
Updated by mkittler over 1 year ago
- Status changed from In Progress to Feedback
The MR for OSD has been deployed correctly. Let's see whether it worked.
I enabled a persistent journal on pi-worker to be able to further investigate the issue there.
Updated by okurz over 1 year ago
- Has duplicate action #125126: [qe-core] Ensure that each SLE release request assigned with "qam-openqa" has at least a "fake" openQA test added
Updated by okurz over 1 year ago
- Has duplicate deleted (action #125126: [qe-core] Ensure that each SLE release request assigned with "qam-openqa" has at least a "fake" openQA test)
Updated by okurz over 1 year ago
- Related to action #125132: [alert] logrotate failed on OSD added
Updated by mkittler over 1 year ago
- Status changed from Feedback to Resolved
I haven't seen the issue again anymore and our alerting would tell us. So I'm resolving this. When it happens next time we will have the persistent log for further investigation.
Updated by mkittler over 1 year ago
I've seen it on openqa-piworker.qa.suse.de again but despite persistent journal there's no trace of the problem because the journal doesn't go far enough. So I guess there's still not much more we can do about it right now.