action #104350
closed[alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically size:M
Description
Observation¶
From journalctl -b -u os-autoinst-openvswitch.service
on grenache-1:
-- Logs begin at Mon 2021-12-20 21:50:34 CET, end at Sun 2021-12-26 14:26:34 CET. --
Dec 26 03:34:34 grenache-1 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 26 03:35:34 grenache-1 wicked[3515]: device br1: unable to apply configuration to nanny
Dec 26 03:36:04 grenache-1 systemd[1]: os-autoinst-openvswitch.service: start-pre operation timed out. Terminating.
Dec 26 03:36:04 grenache-1 systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=killed, status=15/TERM
Dec 26 03:36:04 grenache-1 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'timeout'.
Dec 26 03:36:04 grenache-1 systemd[1]: Failed to start os-autoinst openvswitch helper.
Dec 26 04:38:08 grenache-1 systemd[1]: Starting os-autoinst openvswitch helper...
Dec 26 04:38:08 grenache-1 systemd[1]: Started os-autoinst openvswitch helper.
this triggered an alert as visible on https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&from=1640478716612&to=1640495459276 but as can be seen resolved itself roughly one hour later.
Acceptance criteria¶
- AC1: At least the code flow in https://github.com/os-autoinst/os-autoinst/blob/master/os-autoinst-openvswitch and the corresponding systemd service has been reviewed once
Suggestions¶
- Investigate why os-autoinst-openvswitch.service times out after 2.5m ~ 180s, when the config file in salt-states says 300s and https://github.com/os-autoinst/os-autoinst/blob/master/os-autoinst-openvswitch#L30 reads like there should be indefinite waiting time
Check if one of the related systemd units has retry. If not, add one or extend timeout
on grenache-1.qa there is already
# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'
not sure if this is manually maintained or where this comes from
Updated by okurz almost 3 years ago
- Subject changed from [alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically to [alert] failed systemd service on grenache-1, os-autoinst-openvswitch, turned to "ok" automatically size:M
- Description updated (diff)
- Priority changed from Normal to Low
Updated by mkittler almost 3 years ago
- Status changed from Workable to In Progress
I remember that strange "unable to apply configuration to nanny" error. So it isn't the first time we see this. However, I still cannot really make sense of it.
There's a systemd override which defines the operation which times our here:
martchus@grenache-1:~> systemctl cat os-autoinst-openvswitch.service
# /usr/lib/systemd/system/os-autoinst-openvswitch.service
…
# /etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf
[Service]
Environment="OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=300"
# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'
The configured timeout is only used by os-autoinst-openvswitch itself but has of course no affect on the ExecStartPre
command. I cannot find this override in our salt repos so I assume it is a manual configuration or a leftover. The override is also present on other workers but not on all (e.g. it is present on many x86_64 workers as well and also on arm-2 but none of the other arm workers).
To me this looks like an old workaround for ensuring the bridge is ready before starting the actual service. However, at this point the service handles the waiting itself. So I'm simply going to remove this override on OSD workers where it is still present.
Updated by mkittler almost 3 years ago
- Status changed from In Progress to Resolved
Removed it on all OSD and o3 workers (on o3 only openqaworker7 had it). So now the timeout of 300 seconds should be used everywhere.