action #109443
closed
os-autoinst-openvswitch sometimes fails with `br1 setup-in-progress`
Added by mkittler over 2 years ago.
Updated over 2 years ago.
Category:
Regressions/Crashes
Description
Observation¶
The problem has been observed two times on qa-power8-5-kvm.qa.suse.de, e.g.:
martchus@QA-Power8-5-kvm:~> sudo journalctl --since '2 days ago' -fu os-autoinst-openvswitch
-- Logs begin at Fri 2022-03-18 08:31:24 CET. --
Apr 03 03:30:00 QA-Power8-5-kvm systemd[1]: Stopping os-autoinst openvswitch helper...
Apr 03 03:30:00 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Succeeded.
Apr 03 03:30:00 QA-Power8-5-kvm systemd[1]: Stopped os-autoinst openvswitch helper.
-- Reboot --
Apr 03 03:36:15 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper...
Apr 03 03:36:45 QA-Power8-5-kvm sh[11300]: br1 setup-in-progress
Apr 03 03:36:45 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited, status=162/n/a
Apr 03 03:36:45 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Apr 03 03:36:45 QA-Power8-5-kvm systemd[1]: Failed to start os-autoinst openvswitch helper.
Apr 03 04:38:21 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper...
Apr 03 04:38:22 QA-Power8-5-kvm systemd[1]: Started os-autoinst openvswitch helper.
Apr 03 07:23:26 QA-Power8-5-kvm systemd[1]: Stopping os-autoinst openvswitch helper...
Apr 03 07:23:26 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Succeeded.
I've also attached the full journal from that time. It looks again that the script is simply called when the machine isn't ready and it aborts before entering the loop in which it would wait with a timeout.
For the first occurrence, see #108668#note-4 (but that ticket unfortunately mixes two completely unrelated problems, only the part about os-autoinst-openvswitch is relevant).
Files
- Related to action #108668: Failed systemd services alert (except openqa.suse.de) for < 60 min added
- Category set to Regressions/Crashes
- Target version set to Ready
I cannot reproduce the behavior when replacing system('ovs-vsctl', 'br-exists', $self->{BRIDGE});
or system('ovs-ofctl', 'add-flow', $self->{BRIDGE}, $rule);
with system('false
);. I get at least
"false" unexpectedly returned exit value 1 at ./os-autoinst-openvswitch line 242.. If I replace
ip …with
falsethe loop is entered and I get
Waiting for IP on bridge 'br0', 60s left ...`.
- Status changed from New to In Progress
It was likely a leftover override:
martchus@QA-Power8-5-kvm:~> systemctl cat os-autoinst-openvswitch.service
# /usr/lib/systemd/system/os-autoinst-openvswitch.service
# unit description file for os-autoinst openvswitch helper
# start using e.g.
# systemctl start os-autoinst-openvswitch.service
[Unit]
Description=os-autoinst openvswitch helper
BindsTo=openvswitch.service
After=openvswitch.service network.target
Before=openqa-worker.target
[Service]
Type=dbus
BusName=org.opensuse.os_autoinst.switch
Environment=OS_AUTOINST_USE_BRIDGE=br0
Environment=OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=60
EnvironmentFile=-/etc/sysconfig/os-autoinst-openvswitch
ExecStart=/usr/lib/os-autoinst/os-autoinst-openvswitch
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf
[Service]
Environment="OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=1200"
# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'
I've been removing the override on all remaining workers and reboot one of the workers a few times to test whether it works now.
Also see https://progress.opensuse.org/issues/59300#note-18 and https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/7f79991ddf91fd2f8b981b8e356a44ef099a61fc.
- Due date set to 2022-04-18
We looked into this together. As we saw in the log the exit code is "162". That is nothing that our scripts set. Also the systemd service output says the output comes from "sh". This is actually done in /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf which is a left-over from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/379 . In #59300#note-18 I stated that I worked on all workers that had been online and salt-controlled at that time. Could very well be that the ppc workers have not been included at that time. mkittler manually fixed that now.
- Status changed from In Progress to Feedback
I've been restarting pw-5 and worked. The restart took very long and the worker was completely stuck in the middle (only a reset via the BMC web interface helped, not even ipmi power cycle
). So I don't dare to touch it again (at least for today).
- Status changed from Feedback to In Progress
- Status changed from In Progress to Resolved
And it went well. So I'm closing the issue.
Also available in: Atom
PDF