Project

General

Profile

Actions

action #175989

closed

Too big logfiles causing failed systemd services alert: logrotate (monitor, openqaw5-xen, s390zl12) size:S

Added by tinita about 1 month ago. Updated 24 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2025-01-22
Due date:
% Done:

0%

Estimated time:

Description

Observation

Date: Wed, 22 Jan 2025 12:55:55 +0100

https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=now-24h&to=now

2025-01-21 23:59:50 monitor logrotate
2025-01-21 23:59:50 openqaw5-xen logrotate
2025-01-21 23:59:50 s390zl12 logrotate

On monitor:

% journalctl -u logrotate --since yesterday
Jan 21 14:31:47 monitor systemd[1]: Starting Rotate log files...
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/messages-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor logrotate[3413]: error: destination /var/log/zypper.log-20250121.xz already exists, skipping rotation
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Jan 21 14:31:51 monitor systemd[1]: logrotate.service: Failed with result 'exit-code'.
Jan 21 14:31:51 monitor systemd[1]: Failed to start Rotate log files.
Jan 22 00:00:01 monitor systemd[1]: Starting Rotate log files...
Jan 22 00:07:49 monitor systemd[1]: logrotate.service: Deactivated successfully.
Jan 22 00:07:49 monitor systemd[1]: Finished Rotate log files.
Jan 22 00:07:49 monitor systemd[1]: logrotate.service: Consumed 7min 9.545s CPU time.

We have a lot of output in zypper.log and messages right now, so we should check why

Acceptance criteria

  • AC1: We have an understanding what triggered the problem on multiple hosts on that date
  • AC2: The error condition consistently does not trigger an alert

Suggestions

  • Look up existing tickets about logrotate. We already had "destination … already exists" multiple times
  • Investigate why the log messages became so big and fix the root cause
  • Consider how the underlying error condition could trigger an alert more directly (instead of indirectly relying on logrotate to "fail")

Related issues 5 (1 open4 closed)

Related to openQA Project (public) - action #176121: salt-states-openqa pipeline deploy fails on master, SaltReqTimeoutError: Message timed outResolvedokurz2025-01-24

Actions
Related to openQA Infrastructure (public) - action #175407: salt state for machine monitor.qe.nue2.suse.org was broken for almost 2 months, nothing was alerting us size:SResolvedokurz

Actions
Related to openQA Infrastructure (public) - action #176250: file corruption in salt controlled config files size:MWorkable

Actions
Related to openQA Infrastructure (public) - action #166313: Failed systemd services on multiple workers due to logrotate errors size:SResolvedlivdywan

Actions
Related to openQA Infrastructure (public) - action #69718: harmonize timezone used for our machines size:MResolveddheidler2020-08-07

Actions
Actions

Also available in: Atom PDF