action #175989
closedToo big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S
Added by tinita about 1 month ago. Updated 24 days ago.
0%
Description
Observation¶
Date: Wed, 22 Jan 2025 12:55:55 +0100
https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=now-24h&to=now
2025-01-21 23:59:50 monitor logrotate
2025-01-21 23:59:50 openqaw5-xen logrotate
2025-01-21 23:59:50 s390zl12 logrotate
On monitor:
% journalctl -u logrotate --since yesterday
Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 21 14:31:51 monitor systemd[1]: Failed to start Rotate log files.
Jan 22 00:00:01 monitor systemd[1]: Starting Rotate log files...
Jan 22 00:07:49 monitor systemd[1]: logrotate.service: Deactivated successfully.
Jan 22 00:07:49 monitor systemd[1]: Finished Rotate log files.
Jan 22 00:07:49 monitor systemd[1]: logrotate.service: Consumed 7min 9.545s CPU time.
We have a lot of output in zypper.log and messages right now, so we should check why
Acceptance criteria¶
- AC1: We have an understanding what triggered the problem on multiple hosts on that date
- AC2: The error condition consistently does not trigger an alert
Suggestions¶
- Look up existing tickets about logrotate. We already had "destination … already exists" multiple times
- Investigate why the log messages became so big and fix the root cause
- Consider how the underlying error condition could trigger an alert more directly (instead of indirectly relying on logrotate to "fail")
Updated by tinita about 1 month ago
- Subject changed from Failed systemd services alert: logroate (monitor, openqaw5-xen, s390zl12) to Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12)
Updated by okurz about 1 month ago
- Tags set to infra, osd, logrotate, reactive work
Updated by okurz about 1 month ago
- Target version changed from Ready to Tools - Next
Updated by okurz about 1 month ago
- Subject changed from Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) to Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz about 1 month ago
- Target version changed from Tools - Next to Ready
Updated by nicksinger 27 days ago
- Priority changed from High to Normal
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation
/etc/logrotate.d/zypper.lr
and /etc/logrotate.d/syslog
both have dateext
to append YYYMMDD to a rotated file instead of just increasing a number and use the size
option to define when to rotate. This should not be a problem as the logrotate.timer just runs once a day but for some reason it was running in the middle of the day:
Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 21 14:31:51 monitor systemd[1]: Failed to start Rotate log files.
Given that we don't loose logs and a false alert only happens if logrotate runs twice a day I think we can lower the prio until @jbaier_cz is back to look into a proper solution.
Updated by dheidler 27 days ago
On that day on monitor:
/var/log/messages-20250121.xz has 2_356_264 lines.
2_224_564 are due to logging every single influxdb get request in that file.
74_892 lines contain this error:
2025-01-20T00:14:37.997040+01:00 monitor grafana[5917]: logger=ngalert.scheduler rule_uid=f2f03ad11bd031f195894a3375da3dfc639e1d40 org_id=1 version=3668 fingerprint=cc4558d00eab3708 now=2025-01-20T00:14:30+01:00 t=2025-01-20T00:14:37.996859595+01:00 level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition body: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"
Then we have another 28_001 lines of general grafana logs and around the same number of mixed other stuff. Noteworthy quite some lines from a service called "loki".
Updated by dheidler 27 days ago
For the zypper logs zypper.log-20250121.xz is of pretty average size (700KB).
zypper.log-20250125.xz is much larger (8.7MB).
According to zypper-log -l zypper.log-20250125.xz
half of the zypper calls look like this:
2025-01-24 23:59 9432 1.14.79 zypper --non-interactive --no-refresh info -t package nginx
2025-01-24 23:59 9504 1.14.79 zypper --non-interactive --no-refresh info -t package grafana
2025-01-25 00:00 10262 1.14.79 zypper --non-interactive --no-refresh info -t package loki
2025-01-25 00:00 10330 1.14.79 zypper --non-interactive --no-refresh info -t package influxdb
2025-01-25 00:01 10738 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-augeas
2025-01-25 00:01 10826 1.14.79 zypper --non-interactive --no-refresh info -t package velociraptor-client
2025-01-25 00:01 10977 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-pandas
2025-01-25 00:01 11014 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:01 11044 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-dataclasses
2025-01-25 00:01 11299 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:01 11455 1.14.79 zypper --non-interactive --no-refresh info -t package nginx
2025-01-25 00:01 11503 1.14.79 zypper --non-interactive --no-refresh info -t package grafana
2025-01-25 00:02 12142 1.14.79 zypper --non-interactive --no-refresh info -t package loki
2025-01-25 00:02 12179 1.14.79 zypper --non-interactive --no-refresh info -t package influxdb
2025-01-25 00:02 12566 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-augeas
2025-01-25 00:02 12654 1.14.79 zypper --non-interactive --no-refresh info -t package velociraptor-client
2025-01-25 00:03 12804 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-pandas
2025-01-25 00:03 12842 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:03 12902 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-dataclasses
2025-01-25 00:03 12940 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:03 13095 1.14.79 zypper --non-interactive --no-refresh info -t package nginx
2025-01-25 00:03 13137 1.14.79 zypper --non-interactive --no-refresh info -t package grafana
2025-01-25 00:03 13773 1.14.79 zypper --non-interactive --no-refresh info -t package loki
2025-01-25 00:03 13810 1.14.79 zypper --non-interactive --no-refresh info -t package influxdb
2025-01-25 00:04 14239 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-augeas
2025-01-25 00:04 14327 1.14.79 zypper --non-interactive --no-refresh info -t package velociraptor-client
2025-01-25 00:04 14479 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-pandas
2025-01-25 00:04 14517 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:05 14578 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-dataclasses
2025-01-25 00:05 14616 1.14.79 zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:05 14771 1.14.79 zypper --non-interactive --no-refresh info -t package nginx
2025-01-25 00:05 14848 1.14.79 zypper --non-interactive --no-refresh info -t package grafana
I would suspect that the calls come from salt.
Updated by jbaier_cz 27 days ago
- Related to action #176121: salt-states-openqa pipeline deploy fails on master, SaltReqTimeoutError: Message timed out added
Updated by jbaier_cz 27 days ago
- Related to action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S added
Updated by tinita 26 days ago
- Related to action #176250: file corruption in salt controlled config files size:M added
Updated by jbaier_cz 25 days ago
Logrotate configuration specifies
/var/log/zypper.log {
compress
dateext
notifempty
missingok
nocreate
maxage 60
rotate 99
size 10M
}
Due to repeating salt calls, the zypper log got filled with the following.
2025-01-23 04:51:22 <1> monitor(15351) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'nginx' =====
2025-01-23 04:51:28 <1> monitor(15396) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'grafana' =====
2025-01-23 04:52:03 <1> monitor(16078) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'loki' =====
2025-01-23 04:52:08 <1> monitor(16116) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'influxdb' =====
The log grew larger than 10M multiple times per day, which in combination with the dateext
made the logrotate to fail. That answers the AC1. Reasons for the salt calls are in the linked tickets.
Updated by jbaier_cz 25 days ago
- Priority changed from High to Normal
No need for high priority anymore, or related logs were already examined (and will be around for some time).
As for AC2, the logrotate will only fail if the log were already rotated that day and since there is no daily rotation for zypper log, it will only fail if something writes more than 10 megabytes in to the log in one day. I checked that the service will auto-recover the next day.
If we want to make logrotate never failing, we need to either drop the date from the log and use just numbers; or add minage 1
into the configuration which should prevent rotating more than once in a single day. Unfortunately, both options need a tweak inside the upstream provided config. I can create appropriate recipe for our salt repo to do it.
Updated by jbaier_cz 25 days ago
- Status changed from In Progress to Feedback
I guess some kind of a crude solution (if we want it) could be https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1366
Updated by tinita 25 days ago
- Related to action #166313: Failed systemd services on multiple workers due to logrotate errors size:S added
Updated by tinita 25 days ago
Alternatively what I suggested in https://progress.opensuse.org/issues/166313#note-7
Somehow it was overlooked I guess
Updated by jbaier_cz 25 days ago
tinita wrote in #note-23:
Alternatively what I suggested in https://progress.opensuse.org/issues/166313#note-7
Somehow it was overlooked I guess
Yeah, having a normal daily rotation (with optional size-related exceptions) would also make sense to me.
Updated by jbaier_cz 24 days ago
- Related to action #69718: harmonize timezone used for our machines size:M added
Updated by jbaier_cz 24 days ago
- Status changed from Feedback to Resolved
I did think a little bit about this and I spotted another completely different issue. The systemd.timer is OnCalendar=daily
which should prevent the issue on its own (there should never be more rotation per day as the service is not running that offten), together with the AccuracySec=1h
it should be running somewhere in the first our of the day and create an log archive for that particular day (because of dateext
).
The log grew larger than 10M multiple times per day, which in combination with the dateext made the logrotate to fail.
That is not correct in this case.
Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...
Something or someone started the service manually during the day which in case of logrotate is not a good idea. There are no longer logs from that time, but #69718 was applied at exactly that time and changed the timezone for the machine which might fiddle with the timers. This is most likely the cause of re-triggering the daily services.
Unless we are planning to change timezone regularly, there is nothing more to do here.