action #124412
[alert] logrotate services failed on openqa-piworker.qa.suse.de and OSD size:M
0%
Description
Observation¶
See https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1676150994873&to=1676257956411 for the time frame. (It also shows some mount units but they were only failing very shortly.)
This is the log of logratate-openqa on OSD:
Feb 13 00:00:03 openqa logrotate[30440]: Now: 2023-02-13 00:00 Feb 13 00:00:03 openqa logrotate[30440]: Last rotated at 2023-02-12 05:00 Feb 13 00:00:03 openqa logrotate[30440]: log does not need rotating (log size is below the 'size' threshold) Feb 13 00:00:03 openqa systemd[1]: logrotate-openqa.service: Deactivated successfully. Feb 13 00:00:03 openqa systemd[1]: Finished Rotate openQA log files. Feb 13 01:00:00 openqa systemd[1]: Starting Rotate openQA log files... Feb 13 01:00:01 openqa logrotate[11051]: reading config file /etc/logrotate.d/openqa Feb 13 01:00:01 openqa logrotate[11051]: warning: 'size' overrides previously specified 'hourly' Feb 13 01:00:01 openqa logrotate[11051]: compress_prog is now /usr/bin/xz Feb 13 01:00:01 openqa logrotate[11051]: compress_ext was changed to .xz Feb 13 01:00:01 openqa logrotate[11051]: uncompress_prog is now /usr/bin/xzdec Feb 13 01:00:01 openqa logrotate[11051]: warning: 'size' overrides previously specified 'hourly' Feb 13 01:00:01 openqa logrotate[11051]: compress_prog is now /usr/bin/xz Feb 13 01:00:01 openqa logrotate[11051]: compress_ext was changed to .xz Feb 13 01:00:01 openqa logrotate[11051]: uncompress_prog is now /usr/bin/xzdec Feb 13 01:00:01 openqa logrotate[11051]: reading config file /etc/logrotate.d/openqa-apache Feb 13 01:00:01 openqa logrotate[11051]: warning: 'size' overrides previously specified 'hourly' Feb 13 01:00:01 openqa logrotate[11051]: compress_prog is now /usr/bin/xz Feb 13 01:00:01 openqa logrotate[11051]: compress_ext was changed to .xz Feb 13 01:00:01 openqa logrotate[11051]: uncompress_prog is now /usr/bin/xzdec Feb 13 01:00:01 openqa logrotate[11051]: error: state file /var/lib/misc/logrotate.status is already locked Feb 13 01:00:01 openqa logrotate[11051]: logrotate does not support parallel execution on the same set of logfiles. Feb 13 01:00:01 openqa systemd[1]: logrotate-openqa.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Feb 13 01:00:01 openqa systemd[1]: logrotate-openqa.service: Failed with result 'exit-code'. Feb 13 01:00:01 openqa systemd[1]: Failed to start Rotate openQA log files. Feb 13 02:00:01 openqa systemd[1]: Starting Rotate openQA log files... Feb 13 02:00:01 openqa logrotate[12997]: reading config file /etc/logrotate.d/openqa
The log for logrotate on the Pi is unfortunately empty. So there's likely not much we can do about the Pi at this point.
I haven't paused the alerts. Let's see whether this is happening again.
Acceptance criteria¶
- AC1: The pi worker no longer triggers the failed systemd services alert
- AC2: OSD no longer triggers the failed systemd services alert
Note that the issue on OSD was a one-time issue and is at this point no concern anymore. It happened again on OSD as well:
Feb 26 00:00:03 openqa systemd[1]: Finished Rotate openQA log files. Feb 26 01:00:00 openqa systemd[1]: Starting Rotate openQA log files... Feb 26 01:00:00 openqa logrotate[16932]: reading config file /etc/logrotate.d/openqa Feb 26 01:00:00 openqa logrotate[16932]: warning: 'size' overrides previously specified 'hourly' Feb 26 01:00:00 openqa logrotate[16932]: compress_prog is now /usr/bin/xz Feb 26 01:00:00 openqa logrotate[16932]: compress_ext was changed to .xz Feb 26 01:00:00 openqa logrotate[16932]: uncompress_prog is now /usr/bin/xzdec Feb 26 01:00:00 openqa logrotate[16932]: warning: 'size' overrides previously specified 'hourly' Feb 26 01:00:00 openqa logrotate[16932]: compress_prog is now /usr/bin/xz Feb 26 01:00:00 openqa logrotate[16932]: compress_ext was changed to .xz Feb 26 01:00:00 openqa logrotate[16932]: uncompress_prog is now /usr/bin/xzdec Feb 26 01:00:00 openqa logrotate[16932]: reading config file /etc/logrotate.d/openqa-apache Feb 26 01:00:00 openqa logrotate[16932]: warning: 'size' overrides previously specified 'hourly' Feb 26 01:00:00 openqa logrotate[16932]: compress_prog is now /usr/bin/xz Feb 26 01:00:00 openqa logrotate[16932]: compress_ext was changed to .xz Feb 26 01:00:00 openqa logrotate[16932]: uncompress_prog is now /usr/bin/xzdec Feb 26 01:00:00 openqa logrotate[16932]: error: state file /var/lib/misc/logrotate.status is already locked Feb 26 01:00:00 openqa logrotate[16932]: logrotate does not support parallel execution on the same set of logfiles. Feb 26 01:00:00 openqa systemd[1]: logrotate-openqa.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Feb 26 01:00:00 openqa systemd[1]: logrotate-openqa.service: Failed with result 'exit-code'. Feb 26 01:00:00 openqa systemd[1]: Failed to start Rotate openQA log files. Feb 26 02:00:00 openqa systemd[1]: Starting Rotate openQA log files...
Suggestions¶
- Look into log files on osd for details (as above)
- Check what happens if logrotate is called manually. It's rather safe to call logrotate again but might trigger the above
- Likely an unclean shutdown caused this. Try to trigger manually.
- Ask dheidler to fix the Pi-worker
Related issues
History
#7
Updated by mkittler 3 months ago
It can't be that logrotate.service
runs into a timeout because the service it oneshot
and the docs say:
Defaults to DefaultTimeoutStartSec= set in the manager, except when Type=oneshot is used, in which case the timeout is disabled by default (see systemd-system.conf(5)).
Since we don't have a persistent journal on the pi-worker it is hard to tell why the service failed.
dheidler So this actually is about logrotate.service
. I confused it with systemd-journal-flush.service
/systemd-journald.service
last time we discussed this. This also means we likely shouldn't just ignore it in our alerting.
#8
Updated by mkittler 3 months ago
Annoyingly, this problem is actually still happening - on the piworker and OSD: https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1677358457916&to=1677476057916
#9
Updated by mkittler 3 months ago
- Description updated (diff)
I'm wondering about the difference between logrotate-openqa.service
(that one fails on OSD) and logrotate.service
(that one fails on the pi-worker). On OSD we have actually both services running:
martchus@openqa:~> sudo systemctl status logrotate.timer logrotate-openqa.timer ● logrotate.timer - Daily rotation of log files Loaded: loaded (/usr/lib/systemd/system/logrotate.timer; enabled; vendor preset: enabled) Active: active (waiting) since Sun 2023-02-26 03:32:33 CET; 1 day 10h ago Trigger: Tue 2023-02-28 00:00:00 CET; 9h left Triggers: ● logrotate.service Docs: man:logrotate(8) man:logrotate.conf(5) Feb 26 03:32:33 openqa systemd[1]: Started Daily rotation of log files. ● logrotate-openqa.timer - Hourly rotation of openqa log files Loaded: loaded (/etc/systemd/system/logrotate-openqa.timer; enabled; vendor preset: disabled) Active: active (waiting) since Sun 2023-02-26 03:32:33 CET; 1 day 10h ago Trigger: Mon 2023-02-27 15:00:00 CET; 43min left Triggers: ● logrotate-openqa.service Docs: man:logrotate(8) man:logrotate.conf(5) Feb 26 03:32:33 openqa systemd[1]: Started Hourly rotation of openqa log files.
Both services are triggered via timers. Is that wanted? Maybe it is the reason why logrotate complains about the parallel execution.
#10
Updated by mkittler 3 months ago
This PR might help with the issue we see on OSD: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/799
It looks like our salt states only mess with logrotate in the openqa.server
role. So the logrotate setup on the pi-worker is most likely the default setup of Leap 15.4. That means the problem on the pi-worker must be a different one as there are no conflicting services. Without persistent logs it is still impossible to pin-down, though.
If we wanted to ignore the logrotate service on the pi-worker we could use the --exclude
option of salt-states-openqa/monitoring/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh
added by a3050d28b5143a8d9052c04dbb946c3161f9bf1d
. We'd only needed to add a way to specify additional parameters for this script via pillars.
#14
Updated by okurz 3 months ago
- Has duplicate action #125126: [qe-core] Ensure that each SLE release request assigned with "qam-openqa" has at least a "fake" openQA test added
#15
Updated by okurz 3 months ago
- Has duplicate deleted (action #125126: [qe-core] Ensure that each SLE release request assigned with "qam-openqa" has at least a "fake" openQA test)
#16
Updated by okurz 3 months ago
- Related to action #125132: [alert] logrotate failed on OSD added