action #109443
closedos-autoinst-openvswitch sometimes fails with `br1 setup-in-progress`
Description
Observation¶
The problem has been observed two times on qa-power8-5-kvm.qa.suse.de, e.g.:
martchus@QA-Power8-5-kvm:~> sudo journalctl --since '2 days ago' -fu os-autoinst-openvswitch
-- Logs begin at Fri 2022-03-18 08:31:24 CET. --
Apr 03 03:30:00 QA-Power8-5-kvm systemd[1]: Stopping os-autoinst openvswitch helper...
Apr 03 03:30:00 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Succeeded.
Apr 03 03:30:00 QA-Power8-5-kvm systemd[1]: Stopped os-autoinst openvswitch helper.
-- Reboot --
Apr 03 03:36:15 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper...
Apr 03 03:36:45 QA-Power8-5-kvm sh[11300]: br1 setup-in-progress
Apr 03 03:36:45 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited, status=162/n/a
Apr 03 03:36:45 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Apr 03 03:36:45 QA-Power8-5-kvm systemd[1]: Failed to start os-autoinst openvswitch helper.
Apr 03 04:38:21 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper...
Apr 03 04:38:22 QA-Power8-5-kvm systemd[1]: Started os-autoinst openvswitch helper.
Apr 03 07:23:26 QA-Power8-5-kvm systemd[1]: Stopping os-autoinst openvswitch helper...
Apr 03 07:23:26 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Succeeded.
I've also attached the full journal from that time. It looks again that the script is simply called when the machine isn't ready and it aborts before entering the loop in which it would wait with a timeout.
For the first occurrence, see #108668#note-4 (but that ticket unfortunately mixes two completely unrelated problems, only the part about os-autoinst-openvswitch is relevant).
Files
Updated by okurz over 2 years ago
- Related to action #108668: Failed systemd services alert (except openqa.suse.de) for < 60 min added
Updated by okurz over 2 years ago
- Category set to Regressions/Crashes
- Target version set to Ready
Updated by mkittler over 2 years ago
I cannot reproduce the behavior when replacing system('ovs-vsctl', 'br-exists', $self->{BRIDGE});
or system('ovs-ofctl', 'add-flow', $self->{BRIDGE}, $rule);
with system('false
);. I get at least
"false" unexpectedly returned exit value 1 at ./os-autoinst-openvswitch line 242.. If I replace
ip …with
falsethe loop is entered and I get
Waiting for IP on bridge 'br0', 60s left ...`.
Updated by mkittler over 2 years ago
- Status changed from New to In Progress
It was likely a leftover override:
martchus@QA-Power8-5-kvm:~> systemctl cat os-autoinst-openvswitch.service
# /usr/lib/systemd/system/os-autoinst-openvswitch.service
# unit description file for os-autoinst openvswitch helper
# start using e.g.
# systemctl start os-autoinst-openvswitch.service
[Unit]
Description=os-autoinst openvswitch helper
BindsTo=openvswitch.service
After=openvswitch.service network.target
Before=openqa-worker.target
[Service]
Type=dbus
BusName=org.opensuse.os_autoinst.switch
Environment=OS_AUTOINST_USE_BRIDGE=br0
Environment=OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=60
EnvironmentFile=-/etc/sysconfig/os-autoinst-openvswitch
ExecStart=/usr/lib/os-autoinst/os-autoinst-openvswitch
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf
[Service]
Environment="OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=1200"
# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'
I've been removing the override on all remaining workers and reboot one of the workers a few times to test whether it works now.
Also see https://progress.opensuse.org/issues/59300#note-18 and https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/7f79991ddf91fd2f8b981b8e356a44ef099a61fc.
Updated by okurz over 2 years ago
- Due date set to 2022-04-18
We looked into this together. As we saw in the log the exit code is "162". That is nothing that our scripts set. Also the systemd service output says the output comes from "sh". This is actually done in /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf which is a left-over from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/379 . In #59300#note-18 I stated that I worked on all workers that had been online and salt-controlled at that time. Could very well be that the ppc workers have not been included at that time. mkittler manually fixed that now.
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
I've been restarting pw-5 and worked. The restart took very long and the worker was completely stuck in the middle (only a reset via the BMC web interface helped, not even ipmi power cycle
). So I don't dare to touch it again (at least for today).
Updated by okurz over 2 years ago
Don't forget what https://progress.opensuse.org/projects/qa/wiki/Tools#Best-practices says: "if it hurts, do it more often": https://www.martinfowler.com/bliki/FrequencyReducesDifficulty.html ;)
Updated by mkittler over 2 years ago
- Status changed from Feedback to In Progress
Ok, one more reboot :-)
Updated by mkittler over 2 years ago
- Status changed from In Progress to Resolved
And it went well. So I'm closing the issue.