Project

General

Profile

Actions

action #175989

closed

Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S

Added by tinita about 1 month ago. Updated 24 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

Date: Wed, 22 Jan 2025 12:55:55 +0100

https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=now-24h&to=now

2025-01-21 23:59:50 monitor logrotate
2025-01-21 23:59:50 openqaw5-xen logrotate
2025-01-21 23:59:50 s390zl12 logrotate

On monitor:

% journalctl -u logrotate --since yesterday
Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 21 14:31:51 monitor systemd[1]: Failed to start Rotate log files.
Jan 22 00:00:01 monitor systemd[1]: Starting Rotate log files...
Jan 22 00:07:49 monitor systemd[1]: logrotate.service: Deactivated successfully.
Jan 22 00:07:49 monitor systemd[1]: Finished Rotate log files.
Jan 22 00:07:49 monitor systemd[1]: logrotate.service: Consumed 7min 9.545s CPU time.

We have a lot of output in zypper.log and messages right now, so we should check why

Acceptance criteria

  • AC1: We have an understanding what triggered the problem on multiple hosts on that date
  • AC2: The error condition consistently does not trigger an alert

Suggestions

  • Look up existing tickets about logrotate. We already had "destination … already exists" multiple times
  • Investigate why the log messages became so big and fix the root cause
  • Consider how the underlying error condition could trigger an alert more directly (instead of indirectly relying on logrotate to "fail")

Related issues 5 (1 open4 closed)

Related to openQA Project (public) - action #176121: salt-states-openqa pipeline deploy fails on master, SaltReqTimeoutError: Message timed outResolvedokurz2025-01-24

Actions
Related to openQA Infrastructure (public) - action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:SResolvedokurz

Actions
Related to openQA Infrastructure (public) - action #176250: file corruption in salt controlled config files size:MWorkable

Actions
Related to openQA Infrastructure (public) - action #166313: Failed systemd services on multiple workers due to logrotate errors size:SResolvedlivdywan

Actions
Related to openQA Infrastructure (public) - action #69718: harmonize timezone used for our machines size:MResolveddheidler2020-08-07

Actions
Actions #1

Updated by tinita about 1 month ago

  • Subject changed from Failed systemd services alert: logroate (monitor, openqaw5-xen, s390zl12) to Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12)
Actions #2

Updated by tinita about 1 month ago

  • Description updated (diff)
Actions #3

Updated by okurz about 1 month ago

  • Tags set to infra, osd, logrotate, reactive work
Actions #4

Updated by okurz about 1 month ago

  • Target version changed from Ready to Tools - Next
Actions #5

Updated by okurz about 1 month ago

  • Subject changed from Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) to Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by okurz about 1 month ago

  • Target version changed from Tools - Next to Ready
Actions #7

Updated by jbaier_cz 28 days ago

  • Assignee set to jbaier_cz
Actions #8

Updated by nicksinger 27 days ago

  • Priority changed from High to Normal
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation

/etc/logrotate.d/zypper.lr and /etc/logrotate.d/syslog both have dateext to append YYYMMDD to a rotated file instead of just increasing a number and use the size option to define when to rotate. This should not be a problem as the logrotate.timer just runs once a day but for some reason it was running in the middle of the day:

Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 21 14:31:51 monitor systemd[1]: Failed to start Rotate log files.

Given that we don't loose logs and a false alert only happens if logrotate runs twice a day I think we can lower the prio until @jbaier_cz is back to look into a proper solution.

Actions #9

Updated by tinita 27 days ago

The suggestion was looking into the root cause - why where the logfiles so big at that day. Something was going on there.
Now if we have that ticket lying around for a bit longer, those logs might be gone by then.

Actions #10

Updated by okurz 27 days ago

  • Priority changed from Normal to High

correct, hence high prio

Actions #11

Updated by dheidler 27 days ago

On that day on monitor:

/var/log/messages-20250121.xz has 2_356_264 lines.
2_224_564 are due to logging every single influxdb get request in that file.
74_892 lines contain this error:

2025-01-20T00:14:37.997040+01:00 monitor grafana[5917]: logger=ngalert.scheduler rule_uid=f2f03ad11bd031f195894a3375da3dfc639e1d40 org_id=1 version=3668 fingerprint=cc4558d00eab3708 now=2025-01-20T00:14:30+01:00 t=2025-01-20T00:14:37.996859595+01:00 level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition body: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"

Then we have another 28_001 lines of general grafana logs and around the same number of mixed other stuff. Noteworthy quite some lines from a service called "loki".

Actions #12

Updated by dheidler 27 days ago

For the zypper logs zypper.log-20250121.xz is of pretty average size (700KB).
zypper.log-20250125.xz is much larger (8.7MB).

According to zypper-log -l zypper.log-20250125.xz half of the zypper calls look like this:

2025-01-24 23:59   9432  1.14.79  zypper --non-interactive --no-refresh info -t package nginx
2025-01-24 23:59   9504  1.14.79  zypper --non-interactive --no-refresh info -t package grafana
2025-01-25 00:00  10262  1.14.79  zypper --non-interactive --no-refresh info -t package loki
2025-01-25 00:00  10330  1.14.79  zypper --non-interactive --no-refresh info -t package influxdb
2025-01-25 00:01  10738  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-augeas
2025-01-25 00:01  10826  1.14.79  zypper --non-interactive --no-refresh info -t package velociraptor-client
2025-01-25 00:01  10977  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-pandas
2025-01-25 00:01  11014  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:01  11044  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-dataclasses
2025-01-25 00:01  11299  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:01  11455  1.14.79  zypper --non-interactive --no-refresh info -t package nginx
2025-01-25 00:01  11503  1.14.79  zypper --non-interactive --no-refresh info -t package grafana
2025-01-25 00:02  12142  1.14.79  zypper --non-interactive --no-refresh info -t package loki
2025-01-25 00:02  12179  1.14.79  zypper --non-interactive --no-refresh info -t package influxdb
2025-01-25 00:02  12566  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-augeas
2025-01-25 00:02  12654  1.14.79  zypper --non-interactive --no-refresh info -t package velociraptor-client
2025-01-25 00:03  12804  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-pandas
2025-01-25 00:03  12842  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:03  12902  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-dataclasses
2025-01-25 00:03  12940  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:03  13095  1.14.79  zypper --non-interactive --no-refresh info -t package nginx
2025-01-25 00:03  13137  1.14.79  zypper --non-interactive --no-refresh info -t package grafana
2025-01-25 00:03  13773  1.14.79  zypper --non-interactive --no-refresh info -t package loki
2025-01-25 00:03  13810  1.14.79  zypper --non-interactive --no-refresh info -t package influxdb
2025-01-25 00:04  14239  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-augeas
2025-01-25 00:04  14327  1.14.79  zypper --non-interactive --no-refresh info -t package velociraptor-client
2025-01-25 00:04  14479  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-pandas
2025-01-25 00:04  14517  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:05  14578  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-dataclasses
2025-01-25 00:05  14616  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:05  14771  1.14.79  zypper --non-interactive --no-refresh info -t package nginx
2025-01-25 00:05  14848  1.14.79  zypper --non-interactive --no-refresh info -t package grafana

I would suspect that the calls come from salt.

Actions #13

Updated by jbaier_cz 27 days ago

  • Related to action #176121: salt-states-openqa pipeline deploy fails on master, SaltReqTimeoutError: Message timed out added
Actions #14

Updated by jbaier_cz 27 days ago

  • Related to action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S added
Actions #15

Updated by okurz 26 days ago

  • Subject changed from Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S to Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S
Actions #16

Updated by tinita 26 days ago · Edited

At least the big zypper.log on monitor is likely related to #176250

salt.apply was called in a loop

Actions #17

Updated by tinita 26 days ago

  • Related to action #176250: file corruption in salt controlled config files size:M added
Actions #18

Updated by jbaier_cz 26 days ago

  • Status changed from Workable to In Progress
Actions #19

Updated by jbaier_cz 25 days ago

Logrotate configuration specifies

/var/log/zypper.log {
    compress
    dateext
    notifempty
    missingok
    nocreate

    maxage 60
    rotate 99
    size 10M
}

Due to repeating salt calls, the zypper log got filled with the following.

2025-01-23 04:51:22 <1> monitor(15351) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'nginx' =====
2025-01-23 04:51:28 <1> monitor(15396) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'grafana' =====
2025-01-23 04:52:03 <1> monitor(16078) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'loki' =====
2025-01-23 04:52:08 <1> monitor(16116) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'influxdb' =====

The log grew larger than 10M multiple times per day, which in combination with the dateext made the logrotate to fail. That answers the AC1. Reasons for the salt calls are in the linked tickets.

Actions #20

Updated by jbaier_cz 25 days ago

  • Priority changed from High to Normal

No need for high priority anymore, or related logs were already examined (and will be around for some time).

As for AC2, the logrotate will only fail if the log were already rotated that day and since there is no daily rotation for zypper log, it will only fail if something writes more than 10 megabytes in to the log in one day. I checked that the service will auto-recover the next day.

If we want to make logrotate never failing, we need to either drop the date from the log and use just numbers; or add minage 1 into the configuration which should prevent rotating more than once in a single day. Unfortunately, both options need a tweak inside the upstream provided config. I can create appropriate recipe for our salt repo to do it.

Actions #21

Updated by jbaier_cz 25 days ago

  • Status changed from In Progress to Feedback

I guess some kind of a crude solution (if we want it) could be https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1366

Actions #22

Updated by tinita 25 days ago

  • Related to action #166313: Failed systemd services on multiple workers due to logrotate errors size:S added
Actions #23

Updated by tinita 25 days ago

Alternatively what I suggested in https://progress.opensuse.org/issues/166313#note-7
Somehow it was overlooked I guess

Actions #24

Updated by jbaier_cz 25 days ago

tinita wrote in #note-23:

Alternatively what I suggested in https://progress.opensuse.org/issues/166313#note-7
Somehow it was overlooked I guess

Yeah, having a normal daily rotation (with optional size-related exceptions) would also make sense to me.

Actions #25

Updated by jbaier_cz 24 days ago

  • Related to action #69718: harmonize timezone used for our machines size:M added
Actions #26

Updated by jbaier_cz 24 days ago

  • Status changed from Feedback to Resolved

I did think a little bit about this and I spotted another completely different issue. The systemd.timer is OnCalendar=daily which should prevent the issue on its own (there should never be more rotation per day as the service is not running that offten), together with the AccuracySec=1h it should be running somewhere in the first our of the day and create an log archive for that particular day (because of dateext).

The log grew larger than 10M multiple times per day, which in combination with the dateext made the logrotate to fail.

That is not correct in this case.

Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...

Something or someone started the service manually during the day which in case of logrotate is not a good idea. There are no longer logs from that time, but #69718 was applied at exactly that time and changed the timezone for the machine which might fiddle with the timers. This is most likely the cause of re-triggering the daily services.

Unless we are planning to change timezone regularly, there is nothing more to do here.

Actions

Also available in: Atom PDF