action #166739
closedConsistent alerts for failed systemd services on o3 size:S
Added by livdywan 6 months ago. Updated 5 months ago.
0%
Description
Motivation¶
There is no consistent monitoring of systemd services on o3. Most errors are ignored or only acted upon when there is a visible impact.
an example of this is errors in openqa-continuous-update:
Sep 11 03:21:01 ariel openqa-continuous-update[9321]: /usr/share/openqa/script/openqa-check-devel-repo: line 39: echo: write error: Broken pipe
Sep 10 10:44:13 ariel openqa-continuous-update[26983]: Could not refresh the repositories because of errors.
Sep 10 10:44:13 ariel openqa-continuous-update[26983]: Skipping repository 'openQA' because of the above error.
Sep 10 10:39:12 ariel openqa-continuous-update[23326]: Could not refresh the repositories because of errors.
Sep 10 10:39:12 ariel openqa-continuous-update[23326]: Skipping repository 'openQA' because of the above error.
Sep 04 05:16:11 ariel openqa-continuous-update[8892]: /usr/share/openqa/script/openqa-check-devel-repo: line 39: echo: write error: Broken pipe
Sep 02 19:52:54 ariel openqa-continuous-update[21123]: /usr/share/openqa/script/openqa-check-devel-repo: line 39: echo: write error: Broken pipe
Sep 02 00:00:02 ariel openqa-continuous-update[19069]: Could not refresh the repositories because of errors.
Sep 02 00:00:02 ariel openqa-continuous-update[19069]: Skipping repository 'openQA' because of the above error.
My guess is nobody looked into those errors. I couldn't find relevant tickets or Slack conversations about those.
Suggestions¶
- Use Munin's systemd_status plugin git
- Look into the plugin if it allows us to get more details about the actually failed services. If not feasible leave it out. Don't implement your own :)
- Research how systemd usually keeps a record of failures
Updated by livdywan 6 months ago
- Copied from action #166433: [alert] Waves of emails due to manual changes in /opt/openqa-trigger-from-obs size:S added
Updated by tinita 6 months ago
- Status changed from New to In Progress
- Assignee set to tinita
I copied systemd_status
and systemd_units
from https://github.com/munin-monitoring/contrib/tree/master/plugins/systemd
into /etc/munin/plugins and put the following into /etc/munin/plugin-conf.d/munin-node
:
[systemd_units]
env.failed_critical 1
Only systemd_units is configurable regarding the critical value.
Let's monitor both graphs and see.
Enable the usual munin tunnel sh -L 8080:localhost:80 o3
and look at:
http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_status.html
http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_units.html
Updated by tinita 6 months ago
I think it will work like that:
We will get an email as soon as there is an average of >= 1.0 failed service in 5 minutes.
That means for timers that run every minute it must fail 5 times in a row within 5 minutes.
For timers that run every 10 or 15 minutes, it's enough to have one failure to get an email, so for sporadic issues that happen like once a day, we would get 2 emails - one for the critical state and one for the resolved state.
For timers that don't run every minute we could actually add a retry.
Updated by okurz 6 months ago
- Due date set to 2024-09-26
From yesterday we got multiple alert emails but couldn't find out which systemd services failed. The current way is not really helpful because munin only states that there are failed systemd services. From yesterday there had been the following emails from https://mailman.suse.de/mlarch/SuSE/o3-admins/2024/o3-admins.2024.09/maillist.html
Munin - processes systemd services - opensuse.org :: openqa.opensuse.org, 19:10:15, o3-admins
Munin - processes systemd services - opensuse.org :: openqa.opensuse.org, 19:05:16, o3-admins
Munin - processes systemd services - opensuse.org :: openqa.opensuse.org, 14:44:22, o3-admins
Munin - processes systemd services - opensuse.org :: openqa.opensuse.org, 14:44:22, o3-admins
Updated by livdywan 5 months ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
tinita wrote in #note-10:
Look into the plugin if it allows us to get more details about the actually failed services.
Munin plugins just deliver keys and number values.
Maybe there would be a way by calling a wrapper script on warnings/critical instead of directly sending out the email.
Right. Alerts to or through external scripts mentions overriding the command
(as opposed to relying on e.g. contact.o3admins.command
from /etc/munin/munin.conf
on o3) which could send a custom email. And maybe this can still be a one-liner without supplying a script.
Updated by livdywan 5 months ago · Edited
- Status changed from In Progress to Feedback
So adding a command to /etc/munin/plugin-conf.d/munin-node
like so should probably work:
[systemd_units]
env.failed_critical 1
command systemctl list-units --failed | mail -s "Munin - ${var:graph_category} ${var:graph_title} - ${var:group} :: ${var:host}" -r "o3-admins@opensuse.org" o3-admins@opensuse.org
Edit: I guess it should just be command here.
Updated by livdywan 5 months ago
- Due date changed from 2024-09-26 to 2024-10-04
So it seems as though the command is never used. I was also thinking maybe it's state-keeping interfering as seemingly it has to go back to 0 before it would pick up more services. Or maybe I don't understand how the plugin scans for failing services - I used systemd-run systemctl start foobar.service
as a means to spawn non-existing services that immediately fail, and it definitely triggers the state (and I tried it a couple times on o3 and this did result in the regular emails being sent out).
Looking at the code of the plugin I would even think we should see the failing services, not least because it says are displayed in order to quickly see which units are failing and why in the webui
.
Updated by tinita 5 months ago
Have you tried setting env.silence_active_extinfo 1
then?
Then maybe it's visible here: http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_units.html
Updated by tinita 5 months ago · Edited
Have you looked into the logfile?
/var/log/munin/munin-node.log
2024/09/26-17:00:04 [19324] Error output from systemd_units:
2024/09/26-17:00:04 [19324] Failed to parse signal string ""Munin".
2024/09/26-17:00:04 [19324] Service 'systemd_units' exited with status 1/0.
Can you fix the config again?
edit: I fixed it and restarted munin.
Please always check the logs
Updated by livdywan 5 months ago
tinita wrote in #note-17:
Have you looked into the logfile?
/var/log/munin/munin-node.log 2024/09/26-17:00:04 [19324] Error output from systemd_units: 2024/09/26-17:00:04 [19324] Failed to parse signal string ""Munin". 2024/09/26-17:00:04 [19324] Service 'systemd_units' exited with status 1/0.
Can you fix the config again?
I guess I was looking in the wrong place. The service never seemed to fail.
Updated by livdywan 5 months ago
- Status changed from In Progress to Feedback
- Investigate why config errors aren't reflected in service failures
They are in my local container which I'm using for testing. I don't know why that's not the case on o3.
- Come up with a working syntax
Apparently plugin commands are stripped from various characters which isn't documented. After trying to rewrite the command I ended up putting the code in /usr/local/bin/mail-units
. I think it works now (tested with my personal email).
Updated by tinita 5 months ago · Edited
- Status changed from Resolved to Feedback
http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_units.html is not updated anymore since yesterday 20:55.
To check if a plugin is working there are two commands to run, it should look like that:
# munin-run systemd_status config
graph_title systemd services
graph_vlabel Services
graph_category processes
graph_args --base 1000 --lower-limit 0
graph_scale no
graph_info Number of services in given activation state.
failed.label Services in failed state
dead.label Services in dead state
running.label Services in running state
exited.label Services in exited state
active.label Services in active state
listening.label Services in listening state
waiting.label Services in waiting state
plugged.label Services in plugged state
mounted.label Services in mounted state
failed.warning 0:0
# munin-run systemd_status
failed.value 0
dead.value 0
running.value 47
exited.value 35
active.value 34
listening.value 4
waiting.value 26
plugged.value 51
mounted.value 26
for systemd_units I get:
# munin-run systemd_units config
# munin-run systemd_units
Also it would be good to state what exactly you changed.
Updated by tinita 5 months ago · Edited
What I see in plugin-conf.d/munin-node:
[systemd_units]
env.failed_critical 1
command /usr/local/bin/mail-units
The command
field for a plugin is telling munin what command to run for retrieving the data. That explains why it doesn't return anything I guess.
https://guide.munin-monitoring.org/en/latest/plugin/use.html
I think for the email we have to use the contact.o3admins.command
in the global config.
Updated by livdywan 5 months ago
Unfortunately I couldn't find a way to trigger commands in my test container. Not finding anything from reading https://guide.munin-monitoring.org/en/latest/tutorial/alert.html#alert-variables-example-usage and https://guide.munin-monitoring.org/en/latest/reference/munin.conf.html I tried to run munin-cron (what upstream docs call munin-update) unsuccessfully. Eventually I attempted to test it on o3 (after changing the mail address temporarily) assuming the command is receiving data via stdin but again couldn't easily see why the script wouldn't work.
Maybe someone else wants to give it a try? In particular 1) clarifying exactly how "command" is executed and 2) how to test the "command" analoguous to munin-run.
Otherwise I'm starting to think I could have written a simple systemd service to send emails in less time than it is taking me to guess how to do this with munin.
Updated by tinita 5 months ago · Edited
To test the alert command without waiting for a status change, one can run
su - munin
/usr/lib/munin/munin-limits --always-send critical
For a script example see:
https://guide.munin-monitoring.org/en/latest/tutorial/alert.html#alerts-to-or-through-external-scripts
https://guide.munin-monitoring.org/en/latest/tutorial/alert.html#reformatting-the-output-message
(I saw the docs mentioned munin-limits
. I could not find it in the normal path, so I did rpm -ql munin | grep limits
. Then I did /usr/lib/munin/munin-limits --help
to figure out how to run it)
Updated by tinita 5 months ago · Edited
I got something working with this:
% cat munin-mail
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
my ($subject, $email) = @ARGV;
my $content = '';
while (<STDIN>) {
$content .= $_;
}
if ($subject =~ m/systemd_units/) {
$content .= "\n" . qx{systemctl --failed};
}
open my $pipe, '|-', 'mail', '-s', $subject, '-r', $email, $email or die "Could not open pipe: $?";
print $pipe $content;
close $pipe or die "Could not close pipe: $?";
# munin.conf
contact.o3admins.command /home/tinita/munin-mail "${var:group} ${var:host} ${var:plugin} ${var:graph_category} '${var:graph_title}'" o3-admins@opensuse.org
tested with
/usr/lib/munin/munin-limits --always-send critical --config /home/tinita/munin-test.conf --stdout --contact o3admins
Updated by livdywan 5 months ago
/usr/lib/munin/munin-limits --always-send critical --config /home/tinita/munin-test.conf --stdout --contact o3admins
This gets me Can't open /var/log/munin/munin-limits.log (Permission denied) at /usr/lib/perl5/vendor_perl/5.26.1/Log/Log4perl/Appender/File.pm line 151.
😶
Updated by livdywan 5 months ago
tinita wrote in #note-29:
You need to run it as user
munin
sudo -u munin /usr/lib/munin/munin-limits --always-send critical --stdout --contact o3admins
works. Except it sends me openqa_minion_jobs
instead of systemd_status
.
I checked via config
as well as munin-update.log and once more I stop seeing emails after it seemed to work once...
Updated by tinita 5 months ago
livdywan wrote in #note-30:
tinita wrote in #note-29:
You need to run it as user
munin
sudo -u munin /usr/lib/munin/munin-limits --always-send critical --stdout --contact o3admins
works. Except it sends meopenqa_minion_jobs
instead ofsystemd_status
.
That's because openqa_minion_jobs is currently critical. So this is expected.
Updated by tinita 5 months ago
One reason I see is that according to
munin-run systemd_units
...
failed.value 0
there is no failed service
but
munin-run systemd_status
...
failed.value 1
Compare http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_units.html and http://127.0.0.1:8080/munin/opensuse.org/openqa.opensuse.org/systemd_status.html
So first you have to make sure that the plugin actually is critical (or warning, then adjust the --always-send
option).
Was that plugin ever reporting a failed service?
According to
rrdtool dump /var/lib/munin/opensuse.org/openqa.opensuse.org-systemd_units-failed-g.rrd | less
It always reported 0 or NaN.
Maybe check how the plugin is retrieving its values.
I will trigger a munin-limits run now to see if we get an email about the critical service now.
Updated by tinita 5 months ago
I got this email, hopefully you too:
Date: Fri, 04 Oct 2024 15:52:47 +0000
From: o3-admins@opensuse.org
To: o3-admins@opensuse.org
Subject: opensuse.org openqa.opensuse.org openqa_minion_jobs minion 'Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed'
opensuse.org :: openqa.opensuse.org :: Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed
CRITICALs: failed is 655.00 (outside range [:500]).
So it's calling our script, and this works.
Updated by livdywan 5 months ago · Edited
Subject: opensuse.org openqa.opensuse.org openqa_minion_jobs minion 'Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed'
opensuse.org :: openqa.opensuse.org :: Minion Jobs - see https://openqa.opensuse.org/minion/jobs?state=failed
CRITICALs: failed is 655.00 (outside range [:500]).So it's calling our script, and this works.
Sure. Except it's not picking up the failing systemd service:
sudo -u munin munin-run systemd_units config | grep ritical
failed.critical 1
Updated by tinita 5 months ago
livdywan wrote in #note-36:
Sure. Except it's not picking up the failing systemd service:
sudo -u munin munin-run systemd_units config | grep ritical failed.critical 1
Well, that's the config output.
It tells munin which value of failed
would make it critical.
The actual values you can get with
sudo munin-run systemd_units
...
failed.value 0
Updated by tinita 5 months ago
One problem seems to be that the service is displayed as loaded failed failed
and the plugin only checks for the first one.
systemctl --no-pager --no-legend --all | grep run-r
● run-r3bd32618339d4da693f8aa9a4cb8cb48.service loaded failed failed /usr/bin/systemctl start abcdef.service
The command the plugin uses is
systemctl --no-pager --no-legend --all | awk '{print $1, $3}'
I'm not sure what the three columns mean exactly. The other plugin seems to detect it, so maybe we can use that one or both:
-if ($subject =~ m/systemd_units/) {
+if ($subject =~ m/systemd_(units|status)/) {
Looking at the munin emails from Sep 15 I also see that those came from "systemd services", and that's the title for the systemd_status
plugin.
Subject: Munin - processes systemd services - opensuse.org :: openqa.opensuse.org
Updated by tinita 5 months ago
I made that change to the script in /usr/local/bin/munin-mail
.
We probably would have to wait until the service is ok and then failed again to get an email.
Munin only sends emails when the status changes. Unless you configure it to always send, e.g.
contact.o3admins.always_send warning critical
I will change that now to see if we get an email and then comment it out again.
Updated by tinita 5 months ago
Ok, I edited munin.conf, restarted munin-node.service, and we got two emails.
Subject: opensuse.org openqa.opensuse.org systemd_status processes 'systemd services'
opensuse.org :: openqa.opensuse.org :: systemd services
WARNINGs: Services in failed state is 1.00 (outside range [0:0]).
UNIT LOAD ACTIVE SUB DESCRIPTION
● run-r3bd32618339d4da693f8aa9a4cb8cb48.service loaded failed failed /usr/bin/systemctl start abcdef.service
LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type.
1 loaded units listed.
That looks good, I'd say.
I commented it out again.
Updated by tinita 5 months ago
- Assignee changed from livdywan to tinita
I stopped and disabled the test service and ran systemctl reset-failed
.
We now got
Subject: opensuse.org openqa.opensuse.org systemd_status processes 'systemd services'
opensuse.org :: openqa.opensuse.org :: systemd services
OKs: Services in failed state is 0.00.
UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.
as an OK email.
What's left?
I think we should add the script to git.
Updated by tinita 5 months ago
Draft: https://github.com/os-autoinst/openQA/pull/5979 Add munin alert email wrapper
Updated by tinita 5 months ago
- Status changed from Feedback to Resolved
https://github.com/os-autoinst/openQA/pull/5979 merged.
I changed /etc/munin/munin.conf
to this:
contact.o3admins.command /usr/share/openqa/script/munin-mail "${var:group} ${var:host} ${var:plugin} ${var:graph_category} '${var:graph_title}'" o3-admins@opensuse.org
and deleted /usr/local/bin/munin-mail
Now waiting until it is happening again.