Project

General

Profile

action #108668

Failed systemd services alert (except openqa.suse.de) for < 60 min

Added by cdywan 3 months ago. Updated 3 months ago.

Status:
Rejected
Priority:
High
Assignee:
Target version:
Start date:
2022-03-21
Due date:
% Done:

0%

Estimated time:

Description

This alert was triggered twice, once yesterday and once on Saturday. I was not able to find out what services may have failed.

[Alerting] Failed systemd services alert (except openqa.suse.de)

Check failed systemd services on hosts with `systemctl --failed`. Hint: Go to parent dashboard https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services to see a list of affected hosts.
Metric name

Value
Sum of failed systemd services

1.000

Related issues

Related to openQA Tests - action #107062: Multiple failures due to network issuesFeedback2021-09-272022-04-15

Related to openQA Project - action #109443: os-autoinst-openvswitch sometimes fails with `br1 setup-in-progress`Resolved2022-04-042022-04-18

History

#1 Updated by okurz 3 months ago

  • Priority changed from Normal to High
  • Target version set to Ready

cdywan wrote:

This alert was triggered twice, once yesterday and once on Saturday. I was not able to find out what services may have failed.

Have you tried the link that is in the alert description?:

Check failed systemd services on hosts with systemctl --failed. Hint: Go to parent dashboard https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services to see a list of affected hosts.

As the alerts had been from the previous days I needed to adjust the monitoring period a bit to see the affected services. https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=now-7d&to=now shows:

  • 2022-03-20 04:37:00 QA-Power8-5-kvm os-autoinst-openvswitch
  • 2022-03-20 00:10:40 openqa-monitor dehydrated

#2 Updated by cdywan 3 months ago

okurz wrote:

cdywan wrote:

This alert was triggered twice, once yesterday and once on Saturday. I was not able to find out what services may have failed.

Have you tried the link that is in the alert description?:

Yes, and in Grafana it was saying nothing failed. Which makes sense since the alert went back to OK.

#3 Updated by mkittler 3 months ago

  • Assignee set to mkittler

#4 Updated by mkittler 3 months ago

The logs of dehydrated don't go far enough:

martchus@openqa-monitor:~> sudo journalctl --since '7 days ago' -fu dehydrated
-- Logs begin at Sun 2022-03-20 17:08:04 CET. --
Mär 21 01:27:10 openqa-monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Mär 21 01:27:10 openqa-monitor dehydrated[13512]: # INFO: Using main config file /etc/dehydrated/config
Mär 21 01:27:10 openqa-monitor dehydrated[13512]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Mär 21 01:27:10 openqa-monitor dehydrated[13512]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Mär 21 01:27:10 openqa-monitor sudo[13512]:     root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Mär 21 01:27:11 openqa-monitor systemd[1]: dehydrated.service: Succeeded.
…
Mär 22 00:32:00 openqa-monitor systemd[1]: Finished Certificate Update Runner for Dehydrated.
Mär 23 01:49:10 openqa-monitor systemd[1]: Starting Certificate Update Runner for Dehydrated...
Mär 23 01:49:10 openqa-monitor dehydrated[30242]: # INFO: Using main config file /etc/dehydrated/config
Mär 23 01:49:10 openqa-monitor dehydrated[30242]: # INFO: Using additional config file /etc/dehydrated/config.d/suse-ca.sh
Mär 23 01:49:10 openqa-monitor dehydrated[30242]: # INFO: Running /usr/bin/dehydrated as dehydrated/dehydrated
Mär 23 01:49:10 openqa-monitor sudo[30242]:     root : PWD=/ ; USER=dehydrated ; GROUP=dehydrated ; COMMAND=/usr/bin/dehydrated --cron
Mär 23 01:49:10 openqa-monitor systemd[1]: dehydrated.service: Succeeded.
Mär 23 01:49:10 openqa-monitor systemd[1]: Finished Certificate Update Runner for Dehydrated.

I wouldn't try to investigate the issue further without logs.


The os-autoinst-openvswitch logs do but it there's not much to go on nevertheless:

martchus@QA-Power8-5-kvm:~> sudo journalctl --since '5 days ago' -fu os-autoinst-openvswitch
-- Logs begin at Sat 2022-02-26 12:14:55 CET. --
Mär 18 17:21:56 QA-Power8-5-kvm ovs-vsctl[72162]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap6 tag=4 vlan_mode=dot1q-tunnel
…
Mär 20 03:30:00 QA-Power8-5-kvm systemd[1]: Stopped os-autoinst openvswitch helper.
-- Reboot --
Mär 20 03:35:31 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper...
Mär 20 03:36:01 QA-Power8-5-kvm sh[3732]: br1             setup-in-progress
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited, status=162/n/a
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: Failed to start os-autoinst openvswitch helper.
Mär 20 04:37:28 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper...
Mär 20 04:37:28 QA-Power8-5-kvm systemd[1]: Started os-autoinst openvswitch helper.
Mär 20 06:50:47 QA-Power8-5-kvm systemd[1]: Stopping os-autoinst openvswitch helper...
Mär 20 06:50:47 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Succeeded.
Mär 20 06:50:47 QA-Power8-5-kvm systemd[1]: Stopped os-autoinst openvswitch helper.
Mär 20 06:50:47 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper...
Mär 20 06:50:47 QA-Power8-5-kvm systemd[1]: Started os-autoinst openvswitch helper.
Mär 21 12:00:58 QA-Power8-5-kvm ovs-vsctl[97218]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set port tap2 tag=26 vlan_mode=dot1q-tunnel

The system was still booting and networking wasn't fully available at the time:

Mär 20 03:35:55 QA-Power8-5-kvm worker[3789]: [warn] [pid:3789] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mär 20 03:35:55 QA-Power8-5-kvm worker[3784]: [warn] [pid:3784] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mär 20 03:35:55 QA-Power8-5-kvm worker[3785]: [warn] [pid:3785] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mär 20 03:35:55 QA-Power8-5-kvm worker[3790]: [warn] [pid:3790] Failed to register at openqa.suse.de - connection error: Can't connect: Name or service not known - trying again in 10 seconds
Mär 20 03:36:00 QA-Power8-5-kvm telegraf[3742]: 2022-03-20T02:36:00Z E! [inputs.ping] Error in plugin: lookup openqa.suse.de on [::1]:53: read udp [::1]:38609->[::1]:53: read: connection refused
Mär 20 03:36:00 QA-Power8-5-kvm ovs-vsctl[5533]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap0
Mär 20 03:36:00 QA-Power8-5-kvm ovs-vsctl[5535]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap64
Mär 20 03:36:00 QA-Power8-5-kvm ovs-vsctl[5537]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap128
Mär 20 03:36:00 QA-Power8-5-kvm ovs-vsctl[5539]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap1
Mär 20 03:36:00 QA-Power8-5-kvm ovs-vsctl[5541]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap65
Mär 20 03:36:00 QA-Power8-5-kvm ovs-vsctl[5543]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap129
Mär 20 03:36:00 QA-Power8-5-kvm ovs-vsctl[5545]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap2
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5547]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap66
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5549]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap130
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5551]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap3
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5553]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap67
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5555]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap131
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5557]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap4
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5559]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap68
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5561]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap132
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5563]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap5
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5565]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap69
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5567]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap133
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5569]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap6
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5571]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap70
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5573]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap134
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5575]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap7
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5577]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap71
Mär 20 03:36:01 QA-Power8-5-kvm ovs-vsctl[5579]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap135
Mär 20 03:36:01 QA-Power8-5-kvm wickedd-dhcp4[3024]: eth3: Request to acquire DHCPv4 lease with UUID f7923662-fc3a-0b00-7f0c-000020000000
Mär 20 03:36:01 QA-Power8-5-kvm wickedd-dhcp6[3025]: eth3: Request to acquire DHCPv6 lease with UUID f7923662-fc3a-0b00-7f0c-000021000000 in mode auto
Mär 20 03:36:01 QA-Power8-5-kvm kernel: NET: Registered protocol family 17
Mär 20 03:36:01 QA-Power8-5-kvm sh[3732]: br1             setup-in-progress
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited, status=162/n/a
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: Failed to start os-autoinst openvswitch helper.
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: Reached target Multi-User System.
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: Starting Update UTMP about System Runlevel Changes...
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: Finished Update UTMP about System Runlevel Changes.
Mär 20 03:36:01 QA-Power8-5-kvm systemd[1]: Startup finished in 3.392s (kernel) + 9.126s (initrd) + 1min 12.873s (userspace) = 1min 25.392s.
Mär 20 03:36:02 QA-Power8-5-kvm wickedd-dhcp4[3024]: eth3: Committed DHCPv4 lease with address 10.162.6.202 (lease time 86400 sec, renew in 43200 sec, rebind in 75600 sec)
Mär 20 03:36:05 QA-Power8-5-kvm worker[3783]: [info] [pid:3783] Registering with openQA openqa.suse.de
Mär 20 03:36:05 QA-Power8-5-kvm worker[3788]: [info] [pid:3788] Registering with openQA openqa.suse.de
Mär 20 03:36:05 QA-Power8-5-kvm worker[3787]: [info] [pid:3787] Registering with openQA openqa.suse.de
Mär 20 03:36:05 QA-Power8-5-kvm worker[3786]: [info] [pid:3786] Registering with openQA openqa.suse.de
Mär 20 03:36:05 QA-Power8-5-kvm worker[3789]: [info] [pid:3789] Registering with openQA openqa.suse.de
Mär 20 03:36:05 QA-Power8-5-kvm worker[3785]: [info] [pid:3785] Registering with openQA openqa.suse.de
Mär 20 03:36:05 QA-Power8-5-kvm worker[3790]: [info] [pid:3790] Registering with openQA openqa.suse.de
Mär 20 03:36:05 QA-Power8-5-kvm worker[3784]: [info] [pid:3784] Registering with openQA openqa.suse.de
Mär 20 03:36:56 QA-Power8-5-kvm chronyd[4153]: Received KoD RATE from 129.70.132.37
Mär 20 03:37:01 QA-Power8-5-kvm chronyd[4153]: Selected source 10.160.0.44 (ntp1.suse.de)
Mär 20 03:38:19 QA-Power8-5-kvm worker[3783]: [warn] [pid:3783] Failed to register at openqa.suse.de - connection error: Connection timed out - trying again in 10 seconds
Mär 20 03:38:19 QA-Power8-5-kvm worker[3786]: [warn] [pid:3786] Failed to register at openqa.suse.de - connection error: Connection timed out - trying again in 10 seconds

We actually have a timeout of 20 minutes but the service already failed after 30 seconds. So it likely already existed before entering the loop where it waits for an IP on the bridge which would also explain that there's almost no output in the logs. Likely it failed on the ovs-vsctl br-exists call or on the ip addr show call.

#5 Updated by mkittler 3 months ago

  • Status changed from New to Feedback

I've been experimenting with os-autoinst-openvswitch locally to understand what kind of error condition would lead to the behavior we've seen but couldn't reproduce the problem. So I'm not sure what to do here. Bumping the timeout likely won't help as the script exited not due to the timeout.

Since I haven't seen this issue before I'm assuming it is the first occurrence. Therefore I'd wait for a second occurrence before putting any further effort into it.

#6 Updated by mkittler 3 months ago

The logs show that os-autoinst-openvswitch.service is started before all networking "stuff" is ready:

Mär 20 03:35:31 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper
Mär 20 03:36:01 QA-Power8-5-kvm kernel: NET: Registered protocol family 17

I wouldn't expect the kernel: NET: Registered protocol family 17 message to show up only 30 seconds after os-autoinst-openvswitch.service is started. Note that os-autoinst-openvswitch.service is setup to start after network.target. Apparently protocol family 17 isn't considered by network.target.

I found it a bit weird that the raw number family number (17) shows up in the message, e.g. locally I get:

[martchus@linux-9lzf ~]$ dmesg | grep -i family
[    0.097159] smpboot: CPU0: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz (family: 0x6, model: 0x5e, stepping: 0x3)
[    0.144946] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    5.361235] NET: Registered PF_MCTP protocol family
[    5.394444] NET: Registered PF_INET protocol family
[    5.397106] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    5.397110] NET: Registered PF_XDP protocol family
[    5.652593] NET: Registered PF_INET6 protocol family
[   12.139583] NET: Registered PF_BLUETOOTH protocol family
[   12.621606] NET: Registered PF_PACKET protocol family
[   13.801671] NET: Registered PF_ALG protocol family

Apparently that's normal on that host (and maybe depends on the kernel version):

martchus@QA-Power8-5-kvm:~> sudo dmesg | grep -i family
[    1.110544] NET: Registered protocol family 16
[    2.006121] NET: Registered protocol family 2
[    2.050785] NET: Registered protocol family 1
[    2.050800] NET: Registered protocol family 44
[    3.866515] NET: Registered protocol family 10
[    3.872371] NET: Registered protocol family 15
[    5.765044] lpfc 0002:01:00.0: 1: [    5.555892] 0:2534 Device Info: SLIFamily=0xb, SLIRev=0x4, IFType=0x2, SLIHint_1=0x2, SLIHint_2=0x0, FT=0x0
[    7.424871] lpfc 0002:01:00.1: 1: [    7.298537] 1:2534 Device Info: SLIFamily=0xb, SLIRev=0x4, IFType=0x2, SLIHint_1=0x2, SLIHint_2=0x0, FT=0x0
[   85.859731] NET: Registered protocol family 17

So apparently family 17 took way longer than all the other protocol families to register. Either this is lazy initialized or something network related took quite long. By the way, 17 is PF_PACKET (see /usr/include/bits/socket.h and man packet).

When trying to invoke the dmesg command I've noticed serious connection issues in my SSH session (while VPN itself or my internet connection were definitely not the issue because other SSH connections to internal hosts were not affected at all). Maybe these sporadic network issues were also the cause of os-autoinst-openvswitch.service failing. Note that the connection never completely goes down. However, after a certain amount of typing it just gets stuck and then it takes several seconds to recover. Maybe this is related to #108845 after all?

#7 Updated by mkittler 3 months ago

I've just checked the SSH connection again (before closing it) and now I cannot reproduce the networking problems anymore. So it is definitely only happening sometimes. In the meantime I also checked some jobs and found e.g. https://openqa.suse.de/tests/8385120. It looks stuck on the installation, maybe also due to networking problems? (In the previous job it looks more like an actual bug, though: https://openqa.suse.de/tests/8374870#step/await_install/2)

#8 Updated by okurz 3 months ago

  • Related to action #107062: Multiple failures due to network issues added

#9 Updated by okurz 3 months ago

If in doubt you can try to do a reboot-test to reproduce the problem, see the last point in https://progress.opensuse.org/projects/openqav3/wiki/#Best-practices-for-infrastructure-work

#10 Updated by mkittler 3 months ago

Maybe both problems were related to networking problems (#108845). I'll wait the next reboot but if it doesn't happen again I'll mark the issue as resolved.

#11 Updated by okurz 3 months ago

  • Status changed from Feedback to Rejected

Discussed it during the weekly estimation meeting. We have not seen the problem again. Other network problems have been fixed meanwhile. We have reviewed the code of os-autoinst-openvswitch. We failed to reproduce or simulate the issue. Given that we would see our alerts flagging the issue again in case we ever see it we can followup.

#12 Updated by okurz 3 months ago

  • Related to action #109443: os-autoinst-openvswitch sometimes fails with `br1 setup-in-progress` added

Also available in: Atom PDF