action #123151: [Alerting] Failed systemd services alert - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #123151

closed

[Alerting] Failed systemd services alert

Added by jbaier_cz over 2 years ago. Updated over 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2023-01-16

Due date:

% Done:

Estimated time:

Tags:

alert, infra, reactive work

Description

According to the monitoring we have 6 failed services (since the reboot yesterday?), although the currently failed view is showing only systemd-journal-flush on host jenkins.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by mkittler over 2 years ago

Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 2 years ago

Status changed from New to Feedback

The service ran into a timeout (and was subsequently "killed"):

martchus@jenkins:~> sudo systemctl status systemd-journal-flush.service
× systemd-journal-flush.service - Flush Journal to Persistent Storage
     Loaded: loaded (/usr/lib/systemd/system/systemd-journal-flush.service; static)
     Active: failed (Result: timeout) since Sun 2023-01-15 03:36:29 CET; 1 day 13h ago
       Docs: man:systemd-journald.service(8)
             man:journald.conf(5)
    Process: 620 ExecStart=journalctl --flush (code=killed, signal=TERM)
   Main PID: 620 (code=killed, signal=TERM)

Jan 15 03:34:59 jenkins systemd[1]: Starting Flush Journal to Persistent Storage...
Jan 15 03:36:29 jenkins systemd[1]: systemd-journal-flush.service: start operation timed out. Terminating.
Jan 15 03:36:29 jenkins systemd[1]: systemd-journal-flush.service: Main process exited, code=killed, status=15/TERM
Jan 15 03:36:29 jenkins systemd[1]: systemd-journal-flush.service: Failed with result 'timeout'.
Jan 15 03:36:29 jenkins systemd[1]: Failed to start Flush Journal to Persistent Storage.

I've been restarting the service and it succeeded without delay.

Looks like the system was busy booting at the time but otherwise there are no clues why the flushing was too slow:

Jan 15 03:35:01 jenkins systemd[1]: Mounted /tmp.
Jan 15 03:35:01 jenkins systemd[1]: Mounted /home.
Jan 15 03:35:01 jenkins systemd[1]: Mounted /srv.
Jan 15 03:35:01 jenkins systemd[1]: Mounted /opt.
Jan 15 03:35:01 jenkins systemd[1]: Reached target Local File Systems.
Jan 15 03:35:01 jenkins systemd[1]: Starting Restore /run/initramfs on shutdown...
Jan 15 03:35:01 jenkins systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
Jan 15 03:35:01 jenkins systemd-journald[534]: Time spent on flushing to /var/log/journal/c3cd0595e630742b41710f0d59e8e4d7 is 13.831ms for 805 entries.
Jan 15 03:35:01 jenkins systemd-journald[534]: System Journal (/var/log/journal/c3cd0595e630742b41710f0d59e8e4d7) is 1.2G, max 4.0G, 2.7G free.
Jan 15 03:36:33 jenkins kernel: Adding 2103292k swap on /dev/sda1.  Priority:-2 extents:1 across:2103292k FS
Jan 15 03:35:02 jenkins systemd[1]: check-for-kernel-crash.service: Deactivated successfully.
Jan 15 03:35:02 jenkins systemd[1]: Finished Fail if at least one kernel crash has been recorded under /var/crash.
Jan 15 03:35:02 jenkins systemd[1]: Finished Restore /run/initramfs on shutdown.
Jan 15 03:35:09 jenkins systemd[1]: Condition check resulted in QEMU_HARDDISK 1 being skipped.
Jan 15 03:35:09 jenkins systemd[1]: Activating swap /dev/disk/by-uuid/4b7ce8a7-f32b-419e-9b4a-5eddce8691a8...
Jan 15 03:35:13 jenkins systemd[1]: Activated swap /dev/disk/by-uuid/4b7ce8a7-f32b-419e-9b4a-5eddce8691a8.
Jan 15 03:35:13 jenkins systemd[1]: Reached target Swaps.
Jan 15 03:35:13 jenkins systemd[1]: Started RPC Bind.
Jan 15 03:35:41 jenkins systemd[1]: Finished Load/Save Random Seed.
Jan 15 03:35:41 jenkins systemd[1]: Condition check resulted in First Boot Complete being skipped.
Jan 15 03:35:41 jenkins systemd[1]: Condition check resulted in Store a System Token in an EFI Variable being skipped.
Jan 15 03:35:41 jenkins systemd[1]: Condition check resulted in Commit a transient machine-id on disk being skipped.
Jan 15 03:36:29 jenkins systemd[1]: systemd-journal-flush.service: start operation timed out. Terminating.
Jan 15 03:36:29 jenkins systemd[1]: systemd-journal-flush.service: Main process exited, code=killed, status=15/TERM
Jan 15 03:36:29 jenkins systemd[1]: systemd-journal-flush.service: Failed with result 'timeout'.
Jan 15 03:36:29 jenkins systemd[1]: Failed to start Flush Journal to Persistent Storage.
Jan 15 03:36:29 jenkins systemd[1]: Starting Create Volatile Files and Directories...
Jan 15 03:36:29 jenkins systemd-tmpfiles[639]: /usr/lib/tmpfiles.d/brltty.conf:20: Failed to resolve group 'brltty'.
Jan 15 03:36:29 jenkins systemd-tmpfiles[639]: /usr/lib/tmpfiles.d/brltty.conf:23: Failed to resolve group 'brltty'.
Jan 15 03:36:29 jenkins systemd-tmpfiles[639]: /usr/lib/tmpfiles.d/brltty.conf:26: Failed to resolve group 'brltty'.
Jan 15 03:36:29 jenkins systemd-tmpfiles[639]: /usr/lib/tmpfiles.d/systemd-resolve.conf:10: Duplicate line for path "/etc/resolv.conf", ignoring.
Jan 15 03:36:30 jenkins systemd[1]: Finished Create Volatile Files and Directories.
Jan 15 03:36:30 jenkins systemd[1]: Condition check resulted in Rebuild Journal Catalog being skipped.
Jan 15 03:36:30 jenkins systemd[1]: Condition check resulted in Update is Completed being skipped.
Jan 15 03:36:30 jenkins systemd[1]: Starting Record System Boot/Shutdown in UTMP...
Jan 15 03:36:30 jenkins systemd[1]: Finished Record System Boot/Shutdown in UTMP.
Jan 15 03:36:30 jenkins systemd[1]: Reached target System Initialization.
Jan 15 03:36:30 jenkins systemd[1]: Started Watch /etc/sysconfig/btrfsmaintenance.
Jan 15 03:36:30 jenkins systemd[1]: Started Watch for changes in CA certificates.
Jan 15 03:36:30 jenkins systemd[1]: Started CUPS Scheduler.

Not sure what we can/should do about it considering it only happened once so far.

Actions

Copy link

Updated by jbaier_cz over 2 years ago

Were you able to identify the remaining 5 failed services? Or is the metric counting something else?

Actions

Copy link

Updated by okurz over 2 years ago

If the journal flushing takes long then we could consider to investigate why that is, maybe a process is too chatty in the logs, or increase the timeout, e.g. in a systemd override. Maybe we already do that for other roles in the salt repo?

Actions

Copy link

Updated by mkittler over 2 years ago

Were you able to identify the remaining 5 failed services? Or is the metric counting something else?

No. I have no idea where the number "6" comes from. I resolved this one failing systemd service and the alert is now ok again.

If the journal flushing takes long then we could consider to investigate why that is

I tried but couldn't find many clues.

maybe a process is too chatty in the logs

This was still during the startup (considering other journal messages from around that time) and there wasn't a flood of messages.

or increase the timeout, e.g. in a systemd override. Maybe we already do that for other roles in the salt repo?

I'll check that although I'm not sure whether we should bump timeouts after it only happened once.

Actions

Copy link

Updated by mkittler over 2 years ago

or increase the timeout, e.g. in a systemd override. Maybe we already do that for other roles in the salt repo?

Yes, here's a SR for that: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/784

Actions

Copy link

Updated by mkittler over 2 years ago

The alert was firing again but this time for os-autoinst-openvswitch on worker11. The alert has been ok again since 2 hours because the unit was restarted and then worked again.

Jan 17 11:35:40 worker11 systemd[1]: Stopping os-autoinst openvswitch helper...
Jan 17 11:35:40 worker11 systemd[1]: os-autoinst-openvswitch.service: Deactivated successfully.
Jan 17 11:35:40 worker11 systemd[1]: Stopped os-autoinst openvswitch helper.
-- Boot 75629afb69ed4043930b14cdececdb7a --
Jan 17 11:40:43 worker11 systemd[1]: Starting os-autoinst openvswitch helper...
Jan 17 11:41:38 worker11 sh[2055]: br1             setup-in-progress
Jan 17 11:41:38 worker11 systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited, status=162/n/a
Jan 17 11:41:38 worker11 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Jan 17 11:41:38 worker11 systemd[1]: Failed to start os-autoinst openvswitch helper.
Jan 17 14:44:07 worker11 systemd[1]: Starting os-autoinst openvswitch helper...
Jan 17 14:44:08 worker11 systemd[1]: Started os-autoinst openvswitch helper.
Jan 17 14:44:41 worker11 ovs-vsctl[16694]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap2 tag=70 vlan_mode=dot1q-tunnel

EDIT: We actually have a ticket for this, see #120783.

Actions

Copy link

Updated by mkittler over 2 years ago

I applied https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/784 and now /etc/systemd/system/systemd-journal-flush.service.d/startup-timeout.conf is present on all hosts.

Actions

Copy link

Updated by okurz over 2 years ago

Tags changed from alert to alert, infra, reactive work
Status changed from Feedback to Resolved

MR merged. Discussed in weekly unblock 2023-01-18, resolved

Actions

Copy link

#10

Updated by jbaier_cz over 2 years ago

Copied to action #123493: [Alerting] Failed systemd services alert added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #123151

[Alerting] Failed systemd services alert

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by jbaier_cz over 2 years ago

Updated by okurz over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by mkittler over 2 years ago

Updated by okurz over 2 years ago

Updated by jbaier_cz over 2 years ago