[Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours size:M
Received email "[Alerting] InfluxDB not reachable" at 2022-11-20 03:54. https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?from=1668898957390&to=1668947747978&viewPanel=6 shows the alert to go to pending 03:37, that is the time when we apply automatic reboots when necessary, e.g. for kernel and base library changes, and then back to ok 06:38. That's 3h laters. The system journal shows just:
Nov 20 03:37:13 worker11 50mounted-tests: debug: running subtest /usr/lib/os-probes/mounted/90linux-distro Nov 20 03:37:13 worker11 sh: br1 setup-in-progress Nov 20 03:37:13 worker11 systemd: os-autoinst-openvswitch.service: Control process exited, code=exited, status=162/n/a Nov 20 03:37:13 worker11 systemd: os-autoinst-openvswitch.service: Failed with result 'exit-code'. Nov 20 03:37:13 worker11 systemd: Failed to start os-autoinst openvswitch helper.
- AC1: os-autoinst-openvswitch is stable after repeated reboots on worker11
- In the past there were problematic overrides of the systemd unit present so check for that
- Subject changed from [Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours to [Alerting] failed systemd service on worker11, os-autoinst-openvswitch. Failed at system boot, turned ok after some hours size:M
- Description updated (diff)
- Status changed from New to Workable
- Status changed from Workable to Resolved
- Assignee set to mkittler
I wasn't aware of this ticket anymore. So #123151#note-7 wasn't the first occurrence of the problem. It is even the same worker again.
I've checked again for overrides and there are actually two overrides. The first is our normal bump of the timeout (deployed via salt) and the 2nd seems to be a leftover (not in salt and on other workers):
# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf [Service] ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'
I have removed that file and reloaded daemons. That should fix the problem as it did in the past. I have cleaned up such overrides a year ago (or maybe it was even longer?). Maybe this one wasn't covered because the machine wasn't online at the time. According to
sudo salt -C 'G@roles:worker' cmd.run 'test -e /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf' the override is no gone on all salt-controlled workers. I've also checked o3 workers. I have also rebooted ow11.