Project

General

Profile

Actions

action #124412

closed

[alert] logrotate services failed on openqa-piworker.qa.suse.de and OSD size:M

Added by mkittler about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-02-13
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

See https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1676150994873&to=1676257956411 for the time frame. (It also shows some mount units but they were only failing very shortly.)

This is the log of logratate-openqa on OSD:

Feb 13 00:00:03 openqa logrotate[30440]:   Now: 2023-02-13 00:00
Feb 13 00:00:03 openqa logrotate[30440]:   Last rotated at 2023-02-12 05:00
Feb 13 00:00:03 openqa logrotate[30440]:   log does not need rotating (log size is below the 'size' threshold)
Feb 13 00:00:03 openqa systemd[1]: logrotate-openqa.service: Deactivated successfully.
Feb 13 00:00:03 openqa systemd[1]: Finished Rotate openQA log files.
Feb 13 01:00:00 openqa systemd[1]: Starting Rotate openQA log files...
Feb 13 01:00:01 openqa logrotate[11051]: reading config file /etc/logrotate.d/openqa
Feb 13 01:00:01 openqa logrotate[11051]: warning: 'size' overrides previously specified 'hourly'
Feb 13 01:00:01 openqa logrotate[11051]: compress_prog is now /usr/bin/xz
Feb 13 01:00:01 openqa logrotate[11051]: compress_ext was changed to .xz
Feb 13 01:00:01 openqa logrotate[11051]: uncompress_prog is now /usr/bin/xzdec
Feb 13 01:00:01 openqa logrotate[11051]: warning: 'size' overrides previously specified 'hourly'
Feb 13 01:00:01 openqa logrotate[11051]: compress_prog is now /usr/bin/xz
Feb 13 01:00:01 openqa logrotate[11051]: compress_ext was changed to .xz
Feb 13 01:00:01 openqa logrotate[11051]: uncompress_prog is now /usr/bin/xzdec
Feb 13 01:00:01 openqa logrotate[11051]: reading config file /etc/logrotate.d/openqa-apache
Feb 13 01:00:01 openqa logrotate[11051]: warning: 'size' overrides previously specified 'hourly'
Feb 13 01:00:01 openqa logrotate[11051]: compress_prog is now /usr/bin/xz
Feb 13 01:00:01 openqa logrotate[11051]: compress_ext was changed to .xz
Feb 13 01:00:01 openqa logrotate[11051]: uncompress_prog is now /usr/bin/xzdec
Feb 13 01:00:01 openqa logrotate[11051]: error: state file /var/lib/misc/logrotate.status is already locked
Feb 13 01:00:01 openqa logrotate[11051]: logrotate does not support parallel execution on the same set of logfiles.
Feb 13 01:00:01 openqa systemd[1]: logrotate-openqa.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Feb 13 01:00:01 openqa systemd[1]: logrotate-openqa.service: Failed with result 'exit-code'.
Feb 13 01:00:01 openqa systemd[1]: Failed to start Rotate openQA log files.
Feb 13 02:00:01 openqa systemd[1]: Starting Rotate openQA log files...
Feb 13 02:00:01 openqa logrotate[12997]: reading config file /etc/logrotate.d/openqa

The log for logrotate on the Pi is unfortunately empty. So there's likely not much we can do about the Pi at this point.

I haven't paused the alerts. Let's see whether this is happening again.

Acceptance criteria

  • AC1: The pi worker no longer triggers the failed systemd services alert
  • AC2: OSD no longer triggers the failed systemd services alert

Note that the issue on OSD was a one-time issue and is at this point no concern anymore. It happened again on OSD as well:

Feb 26 00:00:03 openqa systemd[1]: Finished Rotate openQA log files.
Feb 26 01:00:00 openqa systemd[1]: Starting Rotate openQA log files...
Feb 26 01:00:00 openqa logrotate[16932]: reading config file /etc/logrotate.d/openqa
Feb 26 01:00:00 openqa logrotate[16932]: warning: 'size' overrides previously specified 'hourly'
Feb 26 01:00:00 openqa logrotate[16932]: compress_prog is now /usr/bin/xz
Feb 26 01:00:00 openqa logrotate[16932]: compress_ext was changed to .xz
Feb 26 01:00:00 openqa logrotate[16932]: uncompress_prog is now /usr/bin/xzdec
Feb 26 01:00:00 openqa logrotate[16932]: warning: 'size' overrides previously specified 'hourly'
Feb 26 01:00:00 openqa logrotate[16932]: compress_prog is now /usr/bin/xz
Feb 26 01:00:00 openqa logrotate[16932]: compress_ext was changed to .xz
Feb 26 01:00:00 openqa logrotate[16932]: uncompress_prog is now /usr/bin/xzdec
Feb 26 01:00:00 openqa logrotate[16932]: reading config file /etc/logrotate.d/openqa-apache
Feb 26 01:00:00 openqa logrotate[16932]: warning: 'size' overrides previously specified 'hourly'
Feb 26 01:00:00 openqa logrotate[16932]: compress_prog is now /usr/bin/xz
Feb 26 01:00:00 openqa logrotate[16932]: compress_ext was changed to .xz
Feb 26 01:00:00 openqa logrotate[16932]: uncompress_prog is now /usr/bin/xzdec
Feb 26 01:00:00 openqa logrotate[16932]: error: state file /var/lib/misc/logrotate.status is already locked
Feb 26 01:00:00 openqa logrotate[16932]: logrotate does not support parallel execution on the same set of logfiles.
Feb 26 01:00:00 openqa systemd[1]: logrotate-openqa.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Feb 26 01:00:00 openqa systemd[1]: logrotate-openqa.service: Failed with result 'exit-code'.
Feb 26 01:00:00 openqa systemd[1]: Failed to start Rotate openQA log files.
Feb 26 02:00:00 openqa systemd[1]: Starting Rotate openQA log files...

Suggestions

  • Look into log files on osd for details (as above)
  • Check what happens if logrotate is called manually. It's rather safe to call logrotate again but might trigger the above
  • Likely an unclean shutdown caused this. Try to trigger manually.
  • Ask @dheidler to fix the Pi-worker

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #125132: [alert] logrotate failed on OSDResolvedmkittler2023-02-28

Actions
Actions #1

Updated by okurz about 1 year ago

  • Tags changed from alert to alert, infra
Actions #2

Updated by okurz about 1 year ago

  • Description updated (diff)
  • Status changed from New to Resolved
  • Assignee set to okurz

As discussed in unblock: Likely an unclean shutdown caused this and we don't need to handle that now. We will see if we have the issue popping up again.

Actions #3

Updated by jbaier_cz about 1 year ago

  • Status changed from Resolved to New
  • Assignee deleted (okurz)

It is happening again on openqa-piworker.qa.suse.de.

Actions #4

Updated by mkittler about 1 year ago

  • Subject changed from [alert] logrotate services failed on openqa-piworker.qa.suse.de and OSD to [alert] logrotate services failed on openqa-piworker.qa.suse.de and OSD size:M
  • Description updated (diff)
Actions #5

Updated by mkittler about 1 year ago

  • Status changed from New to Workable
Actions #6

Updated by mkittler about 1 year ago

  • Assignee set to mkittler
Actions #7

Updated by mkittler about 1 year ago

It can't be that logrotate.service runs into a timeout because the service it oneshot and the docs say:

Defaults to DefaultTimeoutStartSec= set in the manager, except when Type=oneshot is used, in which case the timeout is disabled by default (see systemd-system.conf(5)).

Since we don't have a persistent journal on the pi-worker it is hard to tell why the service failed.

@dheidler So this actually is about logrotate.service. I confused it with systemd-journal-flush.service/systemd-journald.service last time we discussed this. This also means we likely shouldn't just ignore it in our alerting.

Actions #8

Updated by mkittler about 1 year ago

Annoyingly, this problem is actually still happening - on the piworker and OSD: https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1677358457916&to=1677476057916

Actions #9

Updated by mkittler about 1 year ago

  • Description updated (diff)

I'm wondering about the difference between logrotate-openqa.service (that one fails on OSD) and logrotate.service (that one fails on the pi-worker). On OSD we have actually both services running:

martchus@openqa:~> sudo systemctl status logrotate.timer logrotate-openqa.timer
● logrotate.timer - Daily rotation of log files
     Loaded: loaded (/usr/lib/systemd/system/logrotate.timer; enabled; vendor preset: enabled)
     Active: active (waiting) since Sun 2023-02-26 03:32:33 CET; 1 day 10h ago
    Trigger: Tue 2023-02-28 00:00:00 CET; 9h left
   Triggers: ● logrotate.service
       Docs: man:logrotate(8)
             man:logrotate.conf(5)

Feb 26 03:32:33 openqa systemd[1]: Started Daily rotation of log files.

● logrotate-openqa.timer - Hourly rotation of openqa log files
     Loaded: loaded (/etc/systemd/system/logrotate-openqa.timer; enabled; vendor preset: disabled)
     Active: active (waiting) since Sun 2023-02-26 03:32:33 CET; 1 day 10h ago
    Trigger: Mon 2023-02-27 15:00:00 CET; 43min left
   Triggers: ● logrotate-openqa.service
       Docs: man:logrotate(8)
             man:logrotate.conf(5)

Feb 26 03:32:33 openqa systemd[1]: Started Hourly rotation of openqa log files.

Both services are triggered via timers. Is that wanted? Maybe it is the reason why logrotate complains about the parallel execution.

Actions #10

Updated by mkittler about 1 year ago

This PR might help with the issue we see on OSD: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/799


It looks like our salt states only mess with logrotate in the openqa.server role. So the logrotate setup on the pi-worker is most likely the default setup of Leap 15.4. That means the problem on the pi-worker must be a different one as there are no conflicting services. Without persistent logs it is still impossible to pin-down, though.

If we wanted to ignore the logrotate service on the pi-worker we could use the --exclude option of salt-states-openqa/monitoring/telegraf/scripts/systemd_list_service_by_state_for_telegraf.sh added by a3050d28b5143a8d9052c04dbb946c3161f9bf1d. We'd only needed to add a way to specify additional parameters for this script via pillars.

Actions #11

Updated by okurz about 1 year ago

  • Status changed from Workable to In Progress
Actions #12

Updated by okurz about 1 year ago

Regarding the journal on the piworker we have persistent log files for various services but not the system journal. We suggest to make the system journal persistent, at least for a limited time and monitor for subsequent failures.

Actions #13

Updated by mkittler about 1 year ago

  • Status changed from In Progress to Feedback

The MR for OSD has been deployed correctly. Let's see whether it worked.

I enabled a persistent journal on pi-worker to be able to further investigate the issue there.

Actions #14

Updated by okurz about 1 year ago

  • Has duplicate action #125126: [qe-core] Ensure that each SLE release request assigned with "qam-openqa" has at least a "fake" openQA test added
Actions #15

Updated by okurz about 1 year ago

  • Has duplicate deleted (action #125126: [qe-core] Ensure that each SLE release request assigned with "qam-openqa" has at least a "fake" openQA test)
Actions #16

Updated by okurz about 1 year ago

Actions #17

Updated by mkittler about 1 year ago

  • Status changed from Feedback to Resolved

I haven't seen the issue again anymore and our alerting would tell us. So I'm resolving this. When it happens next time we will have the persistent log for further investigation.

Actions #18

Updated by mkittler about 1 year ago

I've seen it on openqa-piworker.qa.suse.de again but despite persistent journal there's no trace of the problem because the journal doesn't go far enough. So I guess there's still not much more we can do about it right now.

Actions

Also available in: Atom PDF