action #175989: Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #175989

closed

Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S

Added by tinita 3 months ago. Updated 3 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

jbaier_cz

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-01-22

Due date:

% Done:

Estimated time:

Tags:

osd, infra, reactive work, logrotate

Description

Observation¶

Date: Wed, 22 Jan 2025 12:55:55 +0100
https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=now-24h&to=now

2025-01-21 23:59:50 monitor logrotate
2025-01-21 23:59:50 openqaw5-xen logrotate
2025-01-21 23:59:50 s390zl12 logrotate

On monitor:

% journalctl -u logrotate --since yesterday
Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 21 14:31:51 monitor systemd[1]: Failed to start Rotate log files.
Jan 22 00:00:01 monitor systemd[1]: Starting Rotate log files...
Jan 22 00:07:49 monitor systemd[1]: logrotate.service: Deactivated successfully.
Jan 22 00:07:49 monitor systemd[1]: Finished Rotate log files.
Jan 22 00:07:49 monitor systemd[1]: logrotate.service: Consumed 7min 9.545s CPU time.

We have a lot of output in zypper.log and messages right now, so we should check why

Acceptance criteria¶

AC1: We have an understanding what triggered the problem on multiple hosts on that date
AC2: The error condition consistently does not trigger an alert

Suggestions¶

Look up existing tickets about logrotate. We already had "destination … already exists" multiple times
Investigate why the log messages became so big and fix the root cause
Consider how the underlying error condition could trigger an alert more directly (instead of indirectly relying on logrotate to "fail")

Related issues 5 (0 open — 5 closed)

Actions

Copy link

Updated by tinita 3 months ago

Subject changed from Failed systemd services alert: logroate (monitor, openqaw5-xen, s390zl12) to Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12)

Actions

Copy link

Updated by tinita 3 months ago

Description updated (diff)

Actions

Copy link

Updated by okurz 3 months ago

Tags set to infra, osd, logrotate, reactive work

Actions

Copy link

Updated by okurz 3 months ago

Target version changed from Ready to Tools - Next

Actions

Copy link

Updated by okurz 3 months ago

Subject changed from Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) to Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz 3 months ago

Target version changed from Tools - Next to Ready

Actions

Copy link

Updated by jbaier_cz 3 months ago

Assignee set to jbaier_cz

Actions

Copy link

Updated by nicksinger 3 months ago

Priority changed from High to Normal

Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation

/etc/logrotate.d/zypper.lr and /etc/logrotate.d/syslog both have dateext to append YYYMMDD to a rotated file instead of just increasing a number and use the size option to define when to rotate. This should not be a problem as the logrotate.timer just runs once a day but for some reason it was running in the middle of the day:

Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 21 14:31:51 monitor systemd[1]: Failed to start Rotate log files.

Given that we don't loose logs and a false alert only happens if logrotate runs twice a day I think we can lower the prio until @jbaier_cz is back to look into a proper solution.

Actions

Copy link

Updated by tinita 3 months ago

The suggestion was looking into the root cause - why where the logfiles so big at that day. Something was going on there.
Now if we have that ticket lying around for a bit longer, those logs might be gone by then.

Actions

Copy link

#10

Updated by okurz 3 months ago

Priority changed from Normal to High

correct, hence high prio

Actions

Copy link

#11

Updated by dheidler 3 months ago

On that day on monitor:

/var/log/messages-20250121.xz has 2_356_264 lines.
2_224_564 are due to logging every single influxdb get request in that file.
74_892 lines contain this error:

2025-01-20T00:14:37.997040+01:00 monitor grafana[5917]: logger=ngalert.scheduler rule_uid=f2f03ad11bd031f195894a3375da3dfc639e1d40 org_id=1 version=3668 fingerprint=cc4558d00eab3708 now=2025-01-20T00:14:30+01:00 t=2025-01-20T00:14:37.996859595+01:00 level=error msg="Failed to build rule evaluator" error="failed to parse expression 'A': failed to unmarshal remarshaled classic condition body: json: cannot unmarshal number into Go struct field ConditionEvalJSON.evaluator.params of type []float64"

Then we have another 28_001 lines of general grafana logs and around the same number of mixed other stuff. Noteworthy quite some lines from a service called "loki".

Actions

Copy link

#12

Updated by dheidler 3 months ago

For the zypper logs zypper.log-20250121.xz is of pretty average size (700KB).
zypper.log-20250125.xz is much larger (8.7MB).

According to zypper-log -l zypper.log-20250125.xz half of the zypper calls look like this:

2025-01-24 23:59   9432  1.14.79  zypper --non-interactive --no-refresh info -t package nginx
2025-01-24 23:59   9504  1.14.79  zypper --non-interactive --no-refresh info -t package grafana
2025-01-25 00:00  10262  1.14.79  zypper --non-interactive --no-refresh info -t package loki
2025-01-25 00:00  10330  1.14.79  zypper --non-interactive --no-refresh info -t package influxdb
2025-01-25 00:01  10738  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-augeas
2025-01-25 00:01  10826  1.14.79  zypper --non-interactive --no-refresh info -t package velociraptor-client
2025-01-25 00:01  10977  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-pandas
2025-01-25 00:01  11014  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:01  11044  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-dataclasses
2025-01-25 00:01  11299  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:01  11455  1.14.79  zypper --non-interactive --no-refresh info -t package nginx
2025-01-25 00:01  11503  1.14.79  zypper --non-interactive --no-refresh info -t package grafana
2025-01-25 00:02  12142  1.14.79  zypper --non-interactive --no-refresh info -t package loki
2025-01-25 00:02  12179  1.14.79  zypper --non-interactive --no-refresh info -t package influxdb
2025-01-25 00:02  12566  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-augeas
2025-01-25 00:02  12654  1.14.79  zypper --non-interactive --no-refresh info -t package velociraptor-client
2025-01-25 00:03  12804  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-pandas
2025-01-25 00:03  12842  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:03  12902  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-dataclasses
2025-01-25 00:03  12940  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:03  13095  1.14.79  zypper --non-interactive --no-refresh info -t package nginx
2025-01-25 00:03  13137  1.14.79  zypper --non-interactive --no-refresh info -t package grafana
2025-01-25 00:03  13773  1.14.79  zypper --non-interactive --no-refresh info -t package loki
2025-01-25 00:03  13810  1.14.79  zypper --non-interactive --no-refresh info -t package influxdb
2025-01-25 00:04  14239  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-augeas
2025-01-25 00:04  14327  1.14.79  zypper --non-interactive --no-refresh info -t package velociraptor-client
2025-01-25 00:04  14479  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-pandas
2025-01-25 00:04  14517  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:05  14578  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-dataclasses
2025-01-25 00:05  14616  1.14.79  zypper --non-interactive --xmlout --no-refresh search --match-exact python3-requests
2025-01-25 00:05  14771  1.14.79  zypper --non-interactive --no-refresh info -t package nginx
2025-01-25 00:05  14848  1.14.79  zypper --non-interactive --no-refresh info -t package grafana

I would suspect that the calls come from salt.

Actions

Copy link

#13

Updated by jbaier_cz 3 months ago

Related to action #176121: salt-states-openqa pipeline deploy fails on master, SaltReqTimeoutError: Message timed out added

Actions

Copy link

#14

Updated by jbaier_cz 3 months ago

Related to action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:S added

Actions

Copy link

#15

Updated by okurz 3 months ago

Subject changed from Failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S to Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S

Actions

Copy link

#16

Updated by tinita 3 months ago · Edited

At least the big zypper.log on monitor is likely related to #176250

salt.apply was called in a loop

Actions

Copy link

#17

Updated by tinita 3 months ago

Related to action #176250: file corruption in salt controlled config files size:M added

Actions

Copy link

#18

Updated by jbaier_cz 3 months ago

Status changed from Workable to In Progress

Actions

Copy link

#19

Updated by jbaier_cz 3 months ago

Logrotate configuration specifies

/var/log/zypper.log {
    compress
    dateext
    notifempty
    missingok
    nocreate

    maxage 60
    rotate 99
    size 10M
}

Due to repeating salt calls, the zypper log got filled with the following.

2025-01-23 04:51:22 <1> monitor(15351) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'nginx' =====
2025-01-23 04:51:28 <1> monitor(15396) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'grafana' =====
2025-01-23 04:52:03 <1> monitor(16078) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'loki' =====
2025-01-23 04:52:08 <1> monitor(16116) [zypper] main.cc(main):143 ===== 'zypper' '--non-interactive' '--no-refresh' 'info' '-t' 'package' 'influxdb' =====

The log grew larger than 10M multiple times per day, which in combination with the dateext made the logrotate to fail. That answers the AC1. Reasons for the salt calls are in the linked tickets.

Actions

Copy link

#20

Updated by jbaier_cz 3 months ago

Priority changed from High to Normal

No need for high priority anymore, or related logs were already examined (and will be around for some time).

As for AC2, the logrotate will only fail if the log were already rotated that day and since there is no daily rotation for zypper log, it will only fail if something writes more than 10 megabytes in to the log in one day. I checked that the service will auto-recover the next day.

If we want to make logrotate never failing, we need to either drop the date from the log and use just numbers; or add minage 1 into the configuration which should prevent rotating more than once in a single day. Unfortunately, both options need a tweak inside the upstream provided config. I can create appropriate recipe for our salt repo to do it.

Actions

Copy link

#21

Updated by jbaier_cz 3 months ago

Status changed from In Progress to Feedback

I guess some kind of a crude solution (if we want it) could be https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1366

Actions

Copy link

#22

Updated by tinita 3 months ago

Related to action #166313: Failed systemd services on multiple workers due to logrotate errors size:S added

Actions

Copy link

#23

Updated by tinita 3 months ago

Alternatively what I suggested in https://progress.opensuse.org/issues/166313#note-7
Somehow it was overlooked I guess

Actions

Copy link

#24

Updated by jbaier_cz 3 months ago

tinita wrote in #note-23:

Alternatively what I suggested in https://progress.opensuse.org/issues/166313#note-7
Somehow it was overlooked I guess

Yeah, having a normal daily rotation (with optional size-related exceptions) would also make sense to me.

Actions

Copy link

#25

Updated by jbaier_cz 3 months ago

Related to action #69718: harmonize timezone used for our machines size:M added

Actions

Copy link

#26

Updated by jbaier_cz 3 months ago

Status changed from Feedback to Resolved

I did think a little bit about this and I spotted another completely different issue. The systemd.timer is OnCalendar=daily which should prevent the issue on its own (there should never be more rotation per day as the service is not running that offten), together with the AccuracySec=1h it should be running somewhere in the first our of the day and create an log archive for that particular day (because of dateext).

The log grew larger than 10M multiple times per day, which in combination with the dateext made the logrotate to fail.

That is not correct in this case.

Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...

Something or someone started the service manually during the day which in case of logrotate is not a good idea. There are no longer logs from that time, but #69718 was applied at exactly that time and changed the timezone for the machine which might fiddle with the timers. This is most likely the cause of re-triggering the daily services.

Unless we are planning to change timezone regularly, there is nothing more to do here.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #175989

Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by tinita 3 months ago

Updated by tinita 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by okurz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by nicksinger 3 months ago

Updated by tinita 3 months ago

Updated by okurz 3 months ago

Updated by dheidler 3 months ago

Updated by dheidler 3 months ago

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by okurz 3 months ago

Updated by tinita 3 months ago · Edited

Updated by tinita 3 months ago

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by tinita 3 months ago

Updated by tinita 3 months ago

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago

Updated by jbaier_cz 3 months ago