Project

General

Profile

Actions

action #166739

closed

Consistent alerts for failed systemd services on o3 size:S

Added by livdywan 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-09-12
Due date:
2024-10-18
% Done:

0%

Estimated time:

Description

Motivation

There is no consistent monitoring of systemd services on o3. Most errors are ignored or only acted upon when there is a visible impact.

an example of this is errors in openqa-continuous-update:

Sep 11 03:21:01 ariel openqa-continuous-update[9321]: /usr/share/openqa/script/openqa-check-devel-repo: line 39: echo: write error: Broken pipe                       
Sep 10 10:44:13 ariel openqa-continuous-update[26983]: Could not refresh the repositories because of errors.                                                          
Sep 10 10:44:13 ariel openqa-continuous-update[26983]: Skipping repository 'openQA' because of the above error.                                                       
Sep 10 10:39:12 ariel openqa-continuous-update[23326]: Could not refresh the repositories because of errors.                                                          
Sep 10 10:39:12 ariel openqa-continuous-update[23326]: Skipping repository 'openQA' because of the above error.                                                       
Sep 04 05:16:11 ariel openqa-continuous-update[8892]: /usr/share/openqa/script/openqa-check-devel-repo: line 39: echo: write error: Broken pipe                       
Sep 02 19:52:54 ariel openqa-continuous-update[21123]: /usr/share/openqa/script/openqa-check-devel-repo: line 39: echo: write error: Broken pipe                      
Sep 02 00:00:02 ariel openqa-continuous-update[19069]: Could not refresh the repositories because of errors.                                                          
Sep 02 00:00:02 ariel openqa-continuous-update[19069]: Skipping repository 'openQA' because of the above error.

My guess is nobody looked into those errors. I couldn't find relevant tickets or Slack conversations about those.

Suggestions

  • Use Munin's systemd_status plugin git
  • Look into the plugin if it allows us to get more details about the actually failed services. If not feasible leave it out. Don't implement your own :)
  • Research how systemd usually keeps a record of failures

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #166433: [alert] Waves of emails due to manual changes in /opt/openqa-trigger-from-obs size:SResolvedlivdywan

Actions
Actions #1

Updated by livdywan 6 months ago

  • Copied from action #166433: [alert] Waves of emails due to manual changes in /opt/openqa-trigger-from-obs size:S added
Actions #2

Updated by tinita 6 months ago

  • Description updated (diff)
Actions #3

Updated by tinita 6 months ago

  • Description updated (diff)
Actions #4

Updated by tinita 6 months ago

  • Status changed from New to In Progress
  • Assignee set to tinita

I copied systemd_status and systemd_units from https://github.com/munin-monitoring/contrib/tree/master/plugins/systemd
into /etc/munin/plugins and put the following into /etc/munin/plugin-conf.d/munin-node:

[systemd_units]
env.failed_critical 1

Only systemd_units is configurable regarding the critical value.
Let's monitor both graphs and see.
Enable the usual munin tunnel sh -L 8080:localhost:80 o3 and look at:
http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_status.html
http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_units.html

Actions #5

Updated by tinita 6 months ago

I think it will work like that:

We will get an email as soon as there is an average of >= 1.0 failed service in 5 minutes.
That means for timers that run every minute it must fail 5 times in a row within 5 minutes.

For timers that run every 10 or 15 minutes, it's enough to have one failure to get an email, so for sporadic issues that happen like once a day, we would get 2 emails - one for the critical state and one for the resolved state.

For timers that don't run every minute we could actually add a retry.

Actions #6

Updated by tinita 6 months ago

  • Status changed from In Progress to Feedback
Actions #7

Updated by okurz 6 months ago

  • Category set to Feature requests
  • Target version set to Ready
Actions #8

Updated by okurz 6 months ago

  • Due date set to 2024-09-26

From yesterday we got multiple alert emails but couldn't find out which systemd services failed. The current way is not really helpful because munin only states that there are failed systemd services. From yesterday there had been the following emails from https://mailman.suse.de/mlarch/SuSE/o3-admins/2024/o3-admins.2024.09/maillist.html

Munin - processes systemd services - opensuse.org :: openqa.opensuse.org, 19:10:15, o3-admins
Munin - processes systemd services - opensuse.org :: openqa.opensuse.org, 19:05:16, o3-admins
Munin - processes systemd services - opensuse.org :: openqa.opensuse.org, 14:44:22, o3-admins
Munin - processes systemd services - opensuse.org :: openqa.opensuse.org, 14:44:22, o3-admins

Actions #9

Updated by okurz 6 months ago

  • Subject changed from Consistent alerts for failed systemd services on o3 to Consistent alerts for failed systemd services on o3 size:S
  • Description updated (diff)
Actions #10

Updated by tinita 5 months ago

Look into the plugin if it allows us to get more details about the actually failed services.

Munin plugins just deliver keys and number values.

Maybe there would be a way by calling a wrapper script on warnings/critical instead of directly sending out the email.

Actions #11

Updated by tinita 5 months ago

  • Status changed from Feedback to Workable
  • Assignee deleted (tinita)
Actions #12

Updated by livdywan 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

tinita wrote in #note-10:

Look into the plugin if it allows us to get more details about the actually failed services.

Munin plugins just deliver keys and number values.

Maybe there would be a way by calling a wrapper script on warnings/critical instead of directly sending out the email.

Right. Alerts to or through external scripts mentions overriding the command (as opposed to relying on e.g. contact.o3admins.command from /etc/munin/munin.conf on o3) which could send a custom email. And maybe this can still be a one-liner without supplying a script.

Actions #13

Updated by livdywan 5 months ago · Edited

  • Status changed from In Progress to Feedback

So adding a command to /etc/munin/plugin-conf.d/munin-node like so should probably work:

[systemd_units]
env.failed_critical 1
command systemctl list-units --failed | mail -s "Munin - ${var:graph_category} ${var:graph_title} - ${var:group} :: ${var:host}" -r "o3-admins@opensuse.org" o3-admins@opensuse.org

Edit: I guess it should just be command here.

Actions #14

Updated by livdywan 5 months ago

  • Due date changed from 2024-09-26 to 2024-10-04

So it seems as though the command is never used. I was also thinking maybe it's state-keeping interfering as seemingly it has to go back to 0 before it would pick up more services. Or maybe I don't understand how the plugin scans for failing services - I used systemd-run systemctl start foobar.service as a means to spawn non-existing services that immediately fail, and it definitely triggers the state (and I tried it a couple times on o3 and this did result in the regular emails being sent out).

Looking at the code of the plugin I would even think we should see the failing services, not least because it says are displayed in order to quickly see which units are failing and why in the webui.

Actions #15

Updated by tinita 5 months ago

Have you tried setting env.silence_active_extinfo 1 then?
Then maybe it's visible here: http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_units.html

Actions #16

Updated by tinita 5 months ago

livdywan wrote in #note-14:

So it seems as though the command is never used.

  • I believe you need to restart munin-node after such a config change.
  • I think it needs to be contact.o3admins.command in the global config. It can't be set per plugin.
Actions #17

Updated by tinita 5 months ago · Edited

Have you looked into the logfile?

/var/log/munin/munin-node.log
2024/09/26-17:00:04 [19324] Error output from systemd_units:
2024/09/26-17:00:04 [19324]     Failed to parse signal string ""Munin".
2024/09/26-17:00:04 [19324] Service 'systemd_units' exited with status 1/0.

Can you fix the config again?

edit: I fixed it and restarted munin.
Please always check the logs

Actions #18

Updated by livdywan 5 months ago

tinita wrote in #note-17:

Have you looked into the logfile?

/var/log/munin/munin-node.log
2024/09/26-17:00:04 [19324] Error output from systemd_units:
2024/09/26-17:00:04 [19324]     Failed to parse signal string ""Munin".
2024/09/26-17:00:04 [19324] Service 'systemd_units' exited with status 1/0.

Can you fix the config again?

I guess I was looking in the wrong place. The service never seemed to fail.

Actions #19

Updated by livdywan 5 months ago

  • Status changed from Feedback to In Progress

Next steps:

  • Investigate why config errors aren't reflected in service failures
  • Come up with a working syntax
Actions #20

Updated by livdywan 5 months ago

  • Status changed from In Progress to Feedback
  • Investigate why config errors aren't reflected in service failures

They are in my local container which I'm using for testing. I don't know why that's not the case on o3.

  • Come up with a working syntax

Apparently plugin commands are stripped from various characters which isn't documented. After trying to rewrite the command I ended up putting the code in /usr/local/bin/mail-units. I think it works now (tested with my personal email).

Actions #21

Updated by livdywan 5 months ago

  • Status changed from Feedback to Resolved

I assume we're good here then.

Actions #22

Updated by tinita 5 months ago · Edited

  • Status changed from Resolved to Feedback

http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_units.html is not updated anymore since yesterday 20:55.

To check if a plugin is working there are two commands to run, it should look like that:

# munin-run systemd_status config
graph_title systemd services
graph_vlabel Services
graph_category processes
graph_args --base 1000 --lower-limit 0
graph_scale no
graph_info Number of services in given activation state.
failed.label Services in failed state
dead.label Services in dead state
running.label Services in running state
exited.label Services in exited state
active.label Services in active state
listening.label Services in listening state
waiting.label Services in waiting state
plugged.label Services in plugged state
mounted.label Services in mounted state
failed.warning 0:0

# munin-run systemd_status 
failed.value 0
dead.value 0
running.value 47
exited.value 35
active.value 34
listening.value 4
waiting.value 26
plugged.value 51
mounted.value 26

for systemd_units I get:

# munin-run systemd_units config
# munin-run systemd_units 

Also it would be good to state what exactly you changed.

Actions #23

Updated by tinita 5 months ago · Edited

What I see in plugin-conf.d/munin-node:

[systemd_units]
env.failed_critical 1
command /usr/local/bin/mail-units 

The command field for a plugin is telling munin what command to run for retrieving the data. That explains why it doesn't return anything I guess.
https://guide.munin-monitoring.org/en/latest/plugin/use.html

I think for the email we have to use the contact.o3admins.command in the global config.

Actions #24

Updated by tinita 5 months ago

Btw, to look at the past data you can do

rrdtool dump /var/lib/munin/opensuse.org/openqa.opensuse.org-systemd_units-failed-g.rrd
Actions #25

Updated by livdywan 5 months ago

Unfortunately I couldn't find a way to trigger commands in my test container. Not finding anything from reading https://guide.munin-monitoring.org/en/latest/tutorial/alert.html#alert-variables-example-usage and https://guide.munin-monitoring.org/en/latest/reference/munin.conf.html I tried to run munin-cron (what upstream docs call munin-update) unsuccessfully. Eventually I attempted to test it on o3 (after changing the mail address temporarily) assuming the command is receiving data via stdin but again couldn't easily see why the script wouldn't work.

Maybe someone else wants to give it a try? In particular 1) clarifying exactly how "command" is executed and 2) how to test the "command" analoguous to munin-run.
Otherwise I'm starting to think I could have written a simple systemd service to send emails in less time than it is taking me to guess how to do this with munin.

Actions #26

Updated by tinita 5 months ago · Edited

To test the alert command without waiting for a status change, one can run

su - munin
/usr/lib/munin/munin-limits --always-send critical

For a script example see:
https://guide.munin-monitoring.org/en/latest/tutorial/alert.html#alerts-to-or-through-external-scripts
https://guide.munin-monitoring.org/en/latest/tutorial/alert.html#reformatting-the-output-message

(I saw the docs mentioned munin-limits. I could not find it in the normal path, so I did rpm -ql munin | grep limits. Then I did /usr/lib/munin/munin-limits --help to figure out how to run it)

Actions #27

Updated by tinita 5 months ago · Edited

I got something working with this:

% cat munin-mail
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;

my ($subject, $email) = @ARGV;
my $content = '';
while (<STDIN>) {
    $content .= $_;
}
if ($subject =~ m/systemd_units/) {
    $content .= "\n" . qx{systemctl --failed};
}

open my $pipe, '|-', 'mail', '-s', $subject, '-r', $email, $email or die "Could not open pipe: $?";
print $pipe $content;
close $pipe or die "Could not close pipe: $?";
# munin.conf
contact.o3admins.command /home/tinita/munin-mail "${var:group} ${var:host} ${var:plugin} ${var:graph_category} '${var:graph_title}'" o3-admins@opensuse.org

tested with

/usr/lib/munin/munin-limits --always-send critical --config /home/tinita/munin-test.conf  --stdout --contact o3admins
Actions #28

Updated by livdywan 5 months ago

/usr/lib/munin/munin-limits --always-send critical --config /home/tinita/munin-test.conf  --stdout --contact o3admins

This gets me Can't open /var/log/munin/munin-limits.log (Permission denied) at /usr/lib/perl5/vendor_perl/5.26.1/Log/Log4perl/Appender/File.pm line 151. 😶

Actions #29

Updated by tinita 5 months ago

You need to run it as user munin

Actions #30

Updated by livdywan 5 months ago

tinita wrote in #note-29:

You need to run it as user munin

sudo -u munin /usr/lib/munin/munin-limits --always-send critical --stdout --contact o3admins works. Except it sends me openqa_minion_jobs instead of systemd_status.

I checked via config as well as munin-update.log and once more I stop seeing emails after it seemed to work once...

Actions #31

Updated by livdywan 5 months ago

  • Due date changed from 2024-10-04 to 2024-10-18

So I still can't be sure that this works. Hence bumping the due date.

Actions #32

Updated by tinita 5 months ago

livdywan wrote in #note-30:

tinita wrote in #note-29:

You need to run it as user munin

sudo -u munin /usr/lib/munin/munin-limits --always-send critical --stdout --contact o3admins works. Except it sends me openqa_minion_jobs instead of systemd_status.

That's because openqa_minion_jobs is currently critical. So this is expected.

Actions #33

Updated by tinita 5 months ago

How about we have a session together next week? I have the impression there are a lot of misunderstandings.

Actions #34

Updated by tinita 5 months ago

One reason I see is that according to

munin-run  systemd_units
...
failed.value 0

there is no failed service
but

munin-run  systemd_status
...
failed.value 1

Compare http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_units.html and http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_status.html

So first you have to make sure that the plugin actually is critical (or warning, then adjust the --always-send option).

Was that plugin ever reporting a failed service?
According to

rrdtool dump /var/lib/munin/opensuse.org/openqa.opensuse.org-systemd_units-failed-g.rrd | less

It always reported 0 or NaN.
Maybe check how the plugin is retrieving its values.

I will trigger a munin-limits run now to see if we get an email about the critical service now.

Actions #35

Updated by tinita 5 months ago

I got this email, hopefully you too:

Date: Fri, 04 Oct 2024 15:52:47 +0000                                                                                                                                                                         
From: o3-admins@opensuse.org                                                                                                                                                                                  
To: o3-admins@opensuse.org                                                                                                                                                                                    
Subject: opensuse.org openqa.opensuse.org openqa_minion_jobs minion 'Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed'                                                                  

opensuse.org :: openqa.opensuse.org :: Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed
        CRITICALs: failed is 655.00 (outside range [:500]).

So it's calling our script, and this works.

Actions #36

Updated by livdywan 5 months ago · Edited

Subject: opensuse.org openqa.opensuse.org openqa_minion_jobs minion 'Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed'

opensuse.org :: openqa.opensuse.org :: Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed
CRITICALs: failed is 655.00 (outside range [:500]).

So it's calling our script, and this works.

Sure. Except it's not picking up the failing systemd service:

sudo -u munin munin-run systemd_units config | grep ritical           
failed.critical 1
Actions #37

Updated by tinita 5 months ago

livdywan wrote in #note-36:

Sure. Except it's not picking up the failing systemd service:

sudo -u munin munin-run systemd_units config | grep ritical           
failed.critical 1

Well, that's the config output.
It tells munin which value of failed would make it critical.

The actual values you can get with

sudo munin-run systemd_units
...
failed.value 0
Actions #39

Updated by tinita 5 months ago

One problem seems to be that the service is displayed as loaded failed failed and the plugin only checks for the first one.

systemctl --no-pager --no-legend --all | grep run-r
● run-r3bd32618339d4da693f8aa9a4cb8cb48.service                                                                  loaded    failed   failed    /usr/bin/systemctl start abcdef.service

The command the plugin uses is

systemctl --no-pager --no-legend --all | awk '{print $1, $3}'

I'm not sure what the three columns mean exactly. The other plugin seems to detect it, so maybe we can use that one or both:

-if ($subject =~ m/systemd_units/) {
+if ($subject =~ m/systemd_(units|status)/) {

Looking at the munin emails from Sep 15 I also see that those came from "systemd services", and that's the title for the systemd_status plugin.

Subject: Munin - processes systemd services - opensuse.org :: openqa.opensuse.org
Actions #40

Updated by tinita 5 months ago

I made that change to the script in /usr/local/bin/munin-mail.
We probably would have to wait until the service is ok and then failed again to get an email.
Munin only sends emails when the status changes. Unless you configure it to always send, e.g.

contact.o3admins.always_send warning critical

I will change that now to see if we get an email and then comment it out again.

Actions #41

Updated by tinita 5 months ago

Ok, I edited munin.conf, restarted munin-node.service, and we got two emails.

Subject: opensuse.org openqa.opensuse.org systemd_status processes 'systemd services'                                                                                                                         

opensuse.org :: openqa.opensuse.org :: systemd services
        WARNINGs: Services in failed state is 1.00 (outside range [0:0]).


  UNIT                                          LOAD   ACTIVE SUB    DESCRIPTION
● run-r3bd32618339d4da693f8aa9a4cb8cb48.service loaded failed failed /usr/bin/systemctl start abcdef.service

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
1 loaded units listed.

That looks good, I'd say.
I commented it out again.

Actions #42

Updated by tinita 5 months ago

  • Assignee changed from livdywan to tinita

I stopped and disabled the test service and ran systemctl reset-failed.
We now got

Subject: opensuse.org openqa.opensuse.org systemd_status processes 'systemd services'                                                                                                                         

opensuse.org :: openqa.opensuse.org :: systemd services
        OKs: Services in failed state is 0.00.


  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.

as an OK email.

What's left?
I think we should add the script to git.

Actions #43

Updated by tinita 5 months ago

Draft: https://github.com/os-autoinst/openQA/pull/5979 Add munin alert email wrapper

Actions #44

Updated by tinita 5 months ago

  • Status changed from Feedback to Resolved

https://github.com/os-autoinst/openQA/pull/5979 merged.

I changed /etc/munin/munin.conf to this:

contact.o3admins.command /usr/share/openqa/script/munin-mail "${var:group} ${var:host} ${var:plugin} ${var:graph_category} '${var:graph_title}'" o3-admins@opensuse.org

and deleted /usr/local/bin/munin-mail

Now waiting until it is happening again.

Actions

Also available in: Atom PDF