Project

General

Profile

Actions

action #109443

closed

os-autoinst-openvswitch sometimes fails with `br1 setup-in-progress`

Added by mkittler over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2022-04-04
Due date:
2022-04-18
% Done:

0%

Estimated time:

Description

Observation

The problem has been observed two times on qa-power8-5-kvm.qa.suse.de, e.g.:

martchus@QA-Power8-5-kvm:~> sudo journalctl --since '2 days ago' -fu os-autoinst-openvswitch
-- Logs begin at Fri 2022-03-18 08:31:24 CET. --
Apr 03 03:30:00 QA-Power8-5-kvm systemd[1]: Stopping os-autoinst openvswitch helper...
Apr 03 03:30:00 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Succeeded.
Apr 03 03:30:00 QA-Power8-5-kvm systemd[1]: Stopped os-autoinst openvswitch helper.
-- Reboot --
Apr 03 03:36:15 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper...
Apr 03 03:36:45 QA-Power8-5-kvm sh[11300]: br1             setup-in-progress
Apr 03 03:36:45 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited, status=162/n/a
Apr 03 03:36:45 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Apr 03 03:36:45 QA-Power8-5-kvm systemd[1]: Failed to start os-autoinst openvswitch helper.
Apr 03 04:38:21 QA-Power8-5-kvm systemd[1]: Starting os-autoinst openvswitch helper...
Apr 03 04:38:22 QA-Power8-5-kvm systemd[1]: Started os-autoinst openvswitch helper.
Apr 03 07:23:26 QA-Power8-5-kvm systemd[1]: Stopping os-autoinst openvswitch helper...
Apr 03 07:23:26 QA-Power8-5-kvm systemd[1]: os-autoinst-openvswitch.service: Succeeded.

I've also attached the full journal from that time. It looks again that the script is simply called when the machine isn't ready and it aborts before entering the loop in which it would wait with a timeout.

For the first occurrence, see #108668#note-4 (but that ticket unfortunately mixes two completely unrelated problems, only the part about os-autoinst-openvswitch is relevant).


Files

os-autoinst-openvswitch.log (16.1 KB) os-autoinst-openvswitch.log mkittler, 2022-04-04 11:01

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #108668: Failed systemd services alert (except openqa.suse.de) for < 60 minRejectedmkittler2022-03-21

Actions
Actions #1

Updated by okurz over 2 years ago

  • Related to action #108668: Failed systemd services alert (except openqa.suse.de) for < 60 min added
Actions #2

Updated by okurz over 2 years ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #3

Updated by mkittler over 2 years ago

I cannot reproduce the behavior when replacing system('ovs-vsctl', 'br-exists', $self->{BRIDGE}); or system('ovs-ofctl', 'add-flow', $self->{BRIDGE}, $rule); with system('false);. I get at least"false" unexpectedly returned exit value 1 at ./os-autoinst-openvswitch line 242.. If I replaceip …withfalsethe loop is entered and I getWaiting for IP on bridge 'br0', 60s left ...`.

Actions #4

Updated by mkittler over 2 years ago

  • Status changed from New to In Progress

It was likely a leftover override:

martchus@QA-Power8-5-kvm:~> systemctl cat os-autoinst-openvswitch.service 
# /usr/lib/systemd/system/os-autoinst-openvswitch.service
# unit description file for os-autoinst openvswitch helper
# start using e.g.
# systemctl start os-autoinst-openvswitch.service
[Unit]
Description=os-autoinst openvswitch helper
BindsTo=openvswitch.service
After=openvswitch.service network.target
Before=openqa-worker.target

[Service]
Type=dbus
BusName=org.opensuse.os_autoinst.switch
Environment=OS_AUTOINST_USE_BRIDGE=br0
Environment=OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=60
EnvironmentFile=-/etc/sysconfig/os-autoinst-openvswitch
ExecStart=/usr/lib/os-autoinst/os-autoinst-openvswitch

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf
[Service]
Environment="OS_AUTOINST_OPENVSWITCH_INIT_TIMEOUT=1200"

# /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf
[Service]
ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'

I've been removing the override on all remaining workers and reboot one of the workers a few times to test whether it works now.

Also see https://progress.opensuse.org/issues/59300#note-18 and https://gitlab.suse.de/openqa/salt-states-openqa/-/commit/7f79991ddf91fd2f8b981b8e356a44ef099a61fc.

Actions #5

Updated by okurz over 2 years ago

  • Due date set to 2022-04-18

We looked into this together. As we saw in the log the exit code is "162". That is nothing that our scripts set. Also the systemd service output says the output comes from "sh". This is actually done in /etc/systemd/system/os-autoinst-openvswitch.service.d/override.conf which is a left-over from https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/379 . In #59300#note-18 I stated that I worked on all workers that had been online and salt-controlled at that time. Could very well be that the ppc workers have not been included at that time. mkittler manually fixed that now.

Actions #6

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback

I've been restarting pw-5 and worked. The restart took very long and the worker was completely stuck in the middle (only a reset via the BMC web interface helped, not even ipmi power cycle). So I don't dare to touch it again (at least for today).

Actions #8

Updated by mkittler over 2 years ago

  • Status changed from Feedback to In Progress

Ok, one more reboot :-)

Actions #9

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Resolved

And it went well. So I'm closing the issue.

Actions

Also available in: Atom PDF