Project

General

Profile

Actions

action #59300

closed

auto_review:"DBus.*: The name org.opensuse.os_autoinst.switch was not provided":retry GRE tunnel settings not applied on initial setup / after reboot

Added by okurz over 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2019-11-11
Due date:
2020-10-30
% Done:

0%

Estimated time:

Description

Observation

See #32605#note-13 :
https://openqa.suse.de/tests/3516989# failed on openqaworker10, the parallel ref was running on openqaworker8. Seems to be a problem with GRE tunnel. I can see that the config files have been correctly applied on openqaworker10 in files but not in the active config. wicked ifup br1 fixed this as I can see GRE config in ovs-vsctl show now. salt-states-openqa mentions a wicked ifup br1 correctly so I am not sure if this is a generic problem or a single incident linked to incorrect application of salt state on the worker. Probably even a reboot would have fixed it the same. I should see how the system behaves during a reboot.

Reproducible

To be confirmed, e.g. in a clean VM test install.

Problem

https://gitlab.suse.de/openqa/salt-states-openqa/blob/master/openqa/openvswitch.sls#L17 should call wicked ifup br1 but maybe this did not work when the GRE tunnel config was applied. But also the same or similar problem can happen on subsequent reboot, e.g. as reported in #59300#note-3

Suggestions

  • Apply salt states in a clean environment, e.g. new worker install or test VM
  • Call ovs-vsctl show and look if any entries for GRE remote IPs show up, e.g.
sudo ovs-vsctl show | grep -B 3 'options.*remote_ip'
        Port "gre8"
            Interface "gre8"
                type: gre
                options: {remote_ip="10.160.1.20"}

Workaround

Call wicked ifup br1 or whatever is the according bridge name manually after the openvswitch config has been applied.

Actions #1

Updated by okurz over 4 years ago

  • Description updated (diff)
Actions #2

Updated by okurz over 4 years ago

I just reproduced this problem on arm1 and arm3 where I added the "tap" worker class and bridge_ip to worker settings. salt recipes have been applied. But it needed one additional wicked ifup br1 on both machines.

Actions #3

Updated by okurz over 4 years ago

  • Subject changed from GRE tunnel settings seems to be not applied on initial setup, wicked ifup br1 is in salt but maybe not called to GRE tunnel settings seems to be not applied on initial setup and sporadically after reboot, wicked ifup br1 is in salt but maybe not called

Problem observed by @sebchlad "have you seen problems like this? https://openqa.suse.de/tests/3710031 ". Test jobs failed with "org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.os_autoinst.switch was not provided by any .service files" as the service os-autoinst-openvswitch on openqaworker-arm-1 failed after bootup and we do not have any alerts to handle this. So as we looked over it together it seems after the machine openqaworker-arm-1 was automatically rebooted because it crashed it needed a "wicked ifup br1 && systemctl restart os-autoinst-openvswitch" . I have the suspicion that https://github.com/os-autoinst/os-autoinst/blob/master/systemd/os-autoinst-openvswitch.service.in is missing some dependency . What service should ensure "wicked ifup br1" really worked?

From logs on openqaworker-arm-1, journalctl -b:

Dec 13 11:04:25 openqaworker-arm-1 wickedd-nanny[1676]: device br1: failed to bind services and methods for waitDeviceReady()
Dec 13 11:04:28 openqaworker-arm-1 kernel: IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
Dec 13 11:04:28 openqaworker-arm-1 ovs-vsctl[2136]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap0
Dec 13 11:04:28 openqaworker-arm-1 ovs-vsctl[2138]: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --may-exist add-port br1 tap1
Dec 13 11:04:28 openqaworker-arm-1 kernel: nicvf 0002:01:00.1 eth0: Link is Up 10000 Mbps Full duplex
Dec 13 11:04:28 openqaworker-arm-1 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

It seems especially on arm eth takes longer to come up so the bridge is configured before eth0 is up.

I am thinking of putting a workaround in the os-autoinst-openvswitch service override file with wicked ifstatus br1 | grep -q up || wicked ifup br1 as ExecStartPre=/bin/sh -c 'command -v wicked >/dev/null && wicked ifstatus br1 | grep -q up || wicked ifup br1'

-> https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/238/

Actions #4

Updated by okurz over 4 years ago

  • Subject changed from GRE tunnel settings seems to be not applied on initial setup and sporadically after reboot, wicked ifup br1 is in salt but maybe not called to auto_review:"DBus.*: The name org.opensuse.os_autoinst.switch was not provided" GRE tunnel settings not applied on initial setup / after reboot
  • Description updated (diff)
Actions #5

Updated by okurz almost 4 years ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

received alert from grafana. service failed on openqaworker3 after the last recent reboot.

-- Reboot --
Jun 14 21:50:20 linux-5h37 systemd[1]: Starting os-autoinst openvswitch helper...
Jun 14 21:50:20 linux-5h37 sh[2026]: wicked: /org/opensuse/Network/Interface.getManagedObjects failed. Server responds:
Jun 14 21:50:20 linux-5h37 sh[2026]: wicked: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.Network was not provi>
Jun 14 21:50:20 linux-5h37 sh[2026]: wicked: Couldn't refresh list of active network interfaces
Jun 14 21:50:20 linux-5h37 wicked[2045]: /org/opensuse/Network/Interface.getManagedObjects failed. Server responds:
Jun 14 21:50:20 linux-5h37 wicked[2026]: /org/opensuse/Network/Interface.getManagedObjects failed. Server responds:
Jun 14 21:50:20 linux-5h37 systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited status=1
Jun 14 21:50:20 linux-5h37 systemd[1]: Failed to start os-autoinst openvswitch helper.
Jun 14 21:50:20 linux-5h37 systemd[1]: os-autoinst-openvswitch.service: Unit entered failed state.
Jun 14 21:50:20 linux-5h37 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Jun 16 11:28:18 openqaworker3 systemd[1]: Starting os-autoinst openvswitch helper...
Jun 16 11:28:19 openqaworker3 systemd[1]: Started os-autoinst openvswitch helper.

so what actually failed seems to be the workaround here but likely the original service command would have failed shortly afterwards.

I restarted the service manually and that worked so it confirms the original hypothesis. With the help of others, e.g. nsinger, we discussed the problem and we could learn a bit more in the right direction. More context from the system startup process:

un 14 21:50:19 linux-5h37 systemd[1]: Started Open vSwitch Forwarding Unit.
Jun 14 21:50:19 linux-5h37 systemd[1]: Starting Open vSwitch...
Jun 14 21:50:20 linux-5h37 SuSEfirewall2[1548]: Firewall rules set to CLOSE.
Jun 14 21:50:20 linux-5h37 systemd[1]: Started SuSEfirewall2 phase 1.
Jun 14 21:50:20 linux-5h37 systemd[1]: Started Open vSwitch.
Jun 14 21:50:20 linux-5h37 systemd[1]: Starting os-autoinst openvswitch helper...
Jun 14 21:50:20 linux-5h37 systemd[1]: Starting wicked DHCPv6 supplicant service...
Jun 14 21:50:20 linux-5h37 systemd[1]: Starting wicked AutoIPv4 supplicant service...
Jun 14 21:50:20 linux-5h37 systemd[1]: Starting wicked DHCPv4 supplicant service...
Jun 14 21:50:20 linux-5h37 systemd[1]: Starting SuSEfirewall2 phase 2...
Jun 14 21:50:20 linux-5h37 SuSEfirewall2[2030]: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
Jun 14 21:50:20 linux-5h37 SuSEfirewall2[2030]: using default zone 'ext' for interface eth0
Jun 14 21:50:20 linux-5h37 SuSEfirewall2[2030]: using default zone 'ext' for interface eth1
Jun 14 21:50:20 linux-5h37 SuSEfirewall2[2030]: using default zone 'ext' for interface eth2
Jun 14 21:50:20 linux-5h37 SuSEfirewall2[2030]: using default zone 'ext' for interface eth3
Jun 14 21:50:20 linux-5h37 dbus-daemon[1560]: [system] Activating service name='org.opensuse.Snapper' requested by ':1.3' (uid=0 pi>
Jun 14 21:50:20 linux-5h37 dbus-daemon[1560]: [system] Successfully activated service 'org.opensuse.Snapper'
Jun 14 21:50:20 linux-5h37 kernel: Netfilter messages via NETLINK v0.30.
Jun 14 21:50:20 linux-5h37 bash[1701]: File /etc/tgt/conf.d/openqa-scsi-target.conf already loaded.  Use -IncludeAgain to load it a>
Jun 14 21:50:20 linux-5h37 SuSEfirewall2[2030]: Warning: /proc/sys/net/ipv4/ip_forward is not enabled, but required for FW_ROUTE, y>
Jun 14 21:50:20 linux-5h37 SuSEfirewall2[2030]: <36>Jun 14 21:50:20 SuSEfirewall2[2030]: Warning: /proc/sys/net/ipv4/ip_forward is >
Jun 14 21:50:20 linux-5h37 systemd[1]: Started tgt admin.
Jun 14 21:50:20 linux-5h37 systemd[1]: Started wicked DHCPv4 supplicant service.
Jun 14 21:50:20 linux-5h37 sh[2026]: wicked: /org/opensuse/Network/Interface.getManagedObjects failed. Server responds:
Jun 14 21:50:20 linux-5h37 sh[2026]: wicked: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.Network was not provi>
Jun 14 21:50:20 linux-5h37 sh[2026]: wicked: Couldn't refresh list of active network interfaces
Jun 14 21:50:20 linux-5h37 wicked[2045]: /org/opensuse/Network/Interface.getManagedObjects failed. Server responds:
Jun 14 21:50:20 linux-5h37 systemd[1]: Started wicked AutoIPv4 supplicant service.
Jun 14 21:50:20 linux-5h37 wicked[2045]: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.Network was not provided >
Jun 14 21:50:20 linux-5h37 systemd[1]: Started wicked DHCPv6 supplicant service.
Jun 14 21:50:20 linux-5h37 wicked[2045]: Couldn't refresh list of active network interfaces
Jun 14 21:50:20 linux-5h37 systemd[1]: Starting wicked network management service daemon...
Jun 14 21:50:20 linux-5h37 wicked[2026]: /org/opensuse/Network/Interface.getManagedObjects failed. Server responds:
Jun 14 21:50:21 linux-5h37 kernel: No iBFT detected.
Jun 14 21:50:20 linux-5h37 systemd[1]: os-autoinst-openvswitch.service: Control process exited, code=exited status=1
Jun 14 21:50:20 linux-5h37 wicked[2026]: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.Network was not provided >
Jun 14 21:50:20 linux-5h37 systemd[1]: Failed to start os-autoinst openvswitch helper.
Jun 14 21:50:20 linux-5h37 wicked[2026]: Couldn't refresh list of active network interfaces
Jun 14 21:50:21 linux-5h37 systemd-helper[1598]: running number cleanup for 'root'.
Jun 14 21:50:20 linux-5h37 systemd[1]: os-autoinst-openvswitch.service: Unit entered failed state.
Jun 14 21:50:20 linux-5h37 tgtd[1700]: device_mgmt(246) sz:28 params:path=/opt/openqa-iscsi-disk
Jun 14 21:50:20 linux-5h37 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.
Jun 14 21:50:20 linux-5h37 tgtd[1700]: bs_thread_open(409) 16
Jun 14 21:50:20 linux-5h37 systemd[1]: Started wicked network management service daemon.
Jun 14 21:50:20 linux-5h37 systemd[1]: Starting wicked network nanny service...
Jun 14 21:50:20 linux-5h37 systemd[1]: Started wicked network nanny service.
Jun 14 21:50:20 linux-5h37 systemd[1]: Starting wicked managed network interfaces...
…
Jun 14 21:50:23 linux-5h37 systemd[1]: Started The Salt Minion.
Jun 14 21:50:23 linux-5h37 nscd[1694]: 1694 ignored inotify event for `/etc/resolv.conf` (file exists)
Jun 14 21:50:23 linux-5h37 nscd[1694]: 1694 ignored inotify event for `/etc/resolv.conf` (file exists)
Jun 14 21:50:24 linux-5h37 systemd-helper[1598]: running timeline cleanup for 'root'.
Jun 14 21:50:24 linux-5h37 systemd-helper[1598]: running empty-pre-post cleanup for 'root'.
Jun 14 21:50:26 linux-5h37 kernel: igb 0000:01:00.0 eth0: igb: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Jun 14 21:50:26 linux-5h37 kernel: br0: port 1(eth0) entered blocking state
Jun 14 21:50:26 linux-5h37 kernel: br0: port 1(eth0) entered forwarding state
Jun 14 21:50:26 linux-5h37 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): br0: link becomes ready
Jun 14 21:50:26 linux-5h37 wickedd-dhcp4[2029]: br0: Request to acquire DHCPv4 lease with UUID 7e7fe65e-ffa5-0900-7009-00003b000000
Jun 14 21:50:26 linux-5h37 wickedd-dhcp6[2027]: br0: Request to acquire DHCPv6 lease with UUID 7e7fe65e-ffa5-0900-7009-00003c000000>
Jun 14 21:50:26 linux-5h37 kernel: NET: Registered protocol family 17
Jun 14 21:50:27 linux-5h37 wickedd-dhcp4[2029]: br0: Committed DHCPv4 lease with address 10.160.0.243 (lease time 1296000 sec, rene>
Jun 14 21:50:27 linux-5h37 kernel: conntrack: generic helper won't handle protocol 47. Please consider loading the specific helper >
Jun 14 21:50:29 openqaworker3 wickedd-dhcp6[2027]: br0: Committing DHCPv6 lease with:
Jun 14 21:50:29 openqaworker3 wickedd-dhcp6[2027]: br0    +ia-na.address 2620:113:80c0:8080:10:160:0:243/0, pref-lft 10800, valid-l>
Jun 14 21:50:31 openqaworker3 /usr/sbin/start-ntpd[4792]: runtime configuration: server 10.160.0.45 10.160.0.44 10.160.255.254
Jun 14 21:50:31 openqaworker3 kernel: FS-Cache: Loaded
Jun 14 21:50:31 openqaworker3 kernel: RPC: Registered named UNIX socket transport module.
Jun 14 21:50:31 openqaworker3 kernel: RPC: Registered udp transport module.
Jun 14 21:50:31 openqaworker3 kernel: RPC: Registered tcp transport module.
Jun 14 21:50:31 openqaworker3 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module.
Jun 14 21:50:31 openqaworker3 kernel: FS-Cache: Netfs 'nfs' registered for caching
Jun 14 21:50:31 openqaworker3 kernel: Key type dns_resolver registered
Jun 14 21:50:31 openqaworker3 kernel: NFS: Registering the id_resolver key type
Jun 14 21:50:31 openqaworker3 kernel: Key type id_resolver registered
Jun 14 21:50:31 openqaworker3 kernel: Key type id_legacy registered
Jun 14 21:50:32 openqaworker3 systemd[1]: Listening on RPCbind Server Activation Socket.
Jun 14 21:50:32 openqaworker3 systemd[1]: Reached target RPC Port Mapper.
Jun 14 21:50:32 openqaworker3 systemd[1]: Starting NFS status monitor for NFSv2/3 locking....
Jun 14 21:50:32 openqaworker3 systemd[1]: Starting Notify NFS peers of a restart...
Jun 14 21:50:32 openqaworker3 systemd[1]: Starting RPC Bind...
Jun 14 21:50:32 openqaworker3 sm-notify[4820]: Version 2.1.1 starting
Jun 14 21:50:32 openqaworker3 systemd[1]: Started Notify NFS peers of a restart.
Jun 14 21:50:32 openqaworker3 systemd[1]: Started RPC Bind.
Jun 14 21:50:32 openqaworker3 rpc.statd[4846]: Version 2.1.1 starting
Jun 14 21:50:32 openqaworker3 rpc.statd[4846]: Flags: TI-RPC
Jun 14 21:50:32 openqaworker3 systemd[1]: Started NFS status monitor for NFSv2/3 locking..
Jun 14 21:50:32 openqaworker3 systemd[1]: Reloading.
Jun 14 21:50:32 openqaworker3 systemd[1]: Mounted /var/lib/openqa/share.
Jun 14 21:50:32 openqaworker3 systemd[1]: Reached target Remote File Systems.
…
Jun 14 21:50:33 openqaworker3 wicked[2573]: ovs-system      up
Jun 14 21:50:33 openqaworker3 systemd[1]: Started wicked managed network interfaces.

The last message is coming from wicked.service . There is also wickedd.service.

What I could do: mask salt and telegraf and all openqa worker services on openqaworker3, remove the workaround, reboot in a loop until we reproduce the problem with a trustworthy fail ratio, then apply After=nss-lookup.target or something and test again. Or experiment on one of the staging worker hosts, e.g. openqaworker12.

Actions #6

Updated by okurz over 3 years ago

  • Status changed from In Progress to Feedback

So what if I artificially delay wickedd in startup, can I trigger a failure? I put in

# /etc/systemd/system/wickedd.service.d/override.conf
[Service]
ExecStart=
ExecStart=/bin/sleep 20
ExecStart=/usr/sbin/wickedd --systemd --foreground

on openqaworker12 and will try this out. This did not help. The dependency seems to be already correct to wait for network.target . But I can reproduce by prevent the network interfaces to come up. Only try at home when you have access, e.g. IPMI ;) : sed -i "s/STARTMODE='auto'/STARTMODE='no'/" /etc/sysconfig/network/ifcfg-{br1,eth0} . Maybe we can wait for the specific bridge interface we rely upon? There are services like sys-devices-virtual-net-br1.device but as the device can be configured using environment variables as well as the environment file /etc/sysconfig/os-autoinst-openvswitch it is probably better to use something dynamic. https://unix.stackexchange.com/a/212890 describes a nifty approach using systemd-networkd-wait-online but before putting more magic into systemd we can just as well do sleep 1 until -e "/sys/class/net/$self->{BRIDGE}"; in perl code. Yet another approach could be to use the restart feature within systemd services.

Created https://github.com/os-autoinst/os-autoinst/pull/1454 to wait for the right network interface within os-autoinst-openvswitch directly. After that we can also remove the workaround in gitlab.suse.de/openqa/salt-states-openqa

Actions #7

Updated by okurz over 3 years ago

PR is merged, scheduled to be deployed to osd as well tomorrow. Will remove workarounds with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/333

Actions #8

Updated by okurz over 3 years ago

  • Subject changed from auto_review:"DBus.*: The name org.opensuse.os_autoinst.switch was not provided" GRE tunnel settings not applied on initial setup / after reboot to auto_review:"DBus.*: The name org.opensuse.os_autoinst.switch was not provided":retry GRE tunnel settings not applied on initial setup / after reboot

https://gitlab.suse.de/openqa/auto-review/-/jobs/236944 shows 107 jobs relating to this ticket over the past three years. SQL query

select jobs.id,t_finished,state,result,test,reason,host from jobs, comments, workers where result = 'incomplete' and t_finished >= '2020-07-20' and jobs.id = comments.job_id and comments.text ~ '59300' and workers.id = assigned_worker_id;

reveals

   id    |     t_finished      | state |   result   |                   test                   |                                  reason                                  |        host        
---------+---------------------+-------+------------+------------------------------------------+--------------------------------------------------------------------------+--------------------
 4479533 | 2020-07-23 13:33:20 | done  | incomplete | hpc_BETA_mvapich2_mpi_supportserver      | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4479603 | 2020-07-23 13:34:50 | done  | incomplete | hpc_BETA_mpich_mpi_slave01               | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4480838 | 2020-07-23 15:35:22 | done  | incomplete | hpc_BETA_mvapich2_mpi_slave00            | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481323 | 2020-07-23 15:51:15 | done  | incomplete | hpc_pdsh_master                          | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481320 | 2020-07-23 15:51:11 | done  | incomplete | hpc_pdsh_genders_master                  | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4479545 | 2020-07-23 13:34:51 | done  | incomplete | hpc_BETA_slurm_slave01                   | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4480785 | 2020-07-23 15:30:02 | done  | incomplete | hpc_munge_slave                          | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4480682 | 2020-07-23 15:32:20 | done  | incomplete | hpc_ALPHA_slurm_master                   | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4480643 | 2020-07-23 15:33:46 | done  | incomplete | hpc_mrsh_supportserver                   | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481277 | 2020-07-23 15:49:59 | done  | incomplete | hpc_ALPHA_slurm_supportserver            | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481283 | 2020-07-23 15:50:01 | done  | incomplete | hpc_ALPHA_openmpi_mpi_master             | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481315 | 2020-07-23 15:51:16 | done  | incomplete | hpc_mrsh_slave                           | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4479547 | 2020-07-23 13:34:52 | done  | incomplete | hpc_DELTA_slurm_accounting_supportserver | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4479570 | 2020-07-23 13:32:11 | done  | incomplete | hpc_pdsh_genders_slave                   | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4479619 | 2020-07-23 13:36:20 | done  | incomplete | hpc_GAMMA_slurm_db                       | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4480813 | 2020-07-23 15:35:18 | done  | incomplete | hpc_DELTA_slurm_slave01                  | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481234 | 2020-07-23 15:49:08 | done  | incomplete | hpc_BETA_mvapich2_mpi_master             | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481314 | 2020-07-23 15:51:15 | done  | incomplete | hpc_mrsh_master                          | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481348 | 2020-07-23 15:53:22 | done  | incomplete | hpc_BETA_mpich_mpi_slave00               | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4479526 | 2020-07-23 13:31:52 | done  | incomplete | hpc_ALPHA_mpich_mpi_supportserver        | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481351 | 2020-07-23 15:53:32 | done  | incomplete | hpc_BETA_mvapich2_mpi_master             | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4479501 | 2020-07-23 13:31:53 | done  | incomplete | hpc_pdsh_supportserver                   | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4479617 | 2020-07-23 13:34:53 | done  | incomplete | hpc_BETA_openmpi_mpi_slave01             | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4479532 | 2020-07-23 13:33:46 | done  | incomplete | hpc_ALPHA_slurm_slave_03                 | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481219 | 2020-07-23 15:48:53 | done  | incomplete | hpc_ALPHA_mvapich2_mpi_slave01           | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3
 4481486 | 2020-07-23 17:33:51 | done  | incomplete | hpc_BETA_mpich_mpi_supportserver         | backend died: missing input at /usr/lib/os-autoinst/bmwqemu.pm line 209. | openqaworker-arm-3

and more. But only openqaworker-arm-3. On that machine the service os-autoinst-openvswitch.service has failed. Failed services as reported on https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services have been ignored for more than three days now :(

okurz@openqaworker-arm-3:~> sudo journalctl -b -u os-autoinst-openvswitch
-- Logs begin at Wed 2020-01-08 11:03:01 CET, end at Sat 2020-07-25 13:48:37 CEST. --
Jul 22 08:01:01 openqaworker-arm-3 os-autoinst-openvswitch[5813]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 42.
Jul 22 08:00:36 openqaworker-arm-3 systemd[1]: Starting os-autoinst openvswitch helper...
Jul 22 08:00:38 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/n/a
Jul 22 08:00:38 openqaworker-arm-3 systemd[1]: Failed to start os-autoinst openvswitch helper.
Jul 22 08:00:38 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Unit entered failed state.
Jul 22 08:00:38 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Failed with result 'exit-code'.

Problem for reproduction is of course still that simply restarting the service helps. After a reboot the service also came up correctly. Created revert of workaround removal with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/341 and https://github.com/os-autoinst/os-autoinst/pull/1486 for a fix

Actions #9

Updated by okurz over 3 years ago

  • Target version set to Ready
Actions #10

Updated by okurz over 3 years ago

  • Due date set to 2020-08-04
Actions #11

Updated by okurz over 3 years ago

  • Due date changed from 2020-08-04 to 2020-08-27

PR merged and active on o3:

for i in aarch64 openqaworker1 openqaworker4 openqaworker7 power8 imagetester rebel; do echo $i && ssh root@$i "systemctl is-active os-autoinst-openvswitch; journalctl --since=2020-08-01 -u os-autoinst-opensvswitch | grep 'Waiting for IP on bridge'" ; done

so the service is active on aarch64,openqaworker1,openqaworker7 as expected but the waiting line does not appear anywhere. But we never saw this problem on o3 anyway. Would need to check within OSD infra but osd deployment is currently blocked by #69613 so checking later

Actions #12

Updated by okurz over 3 years ago

deployed on osd and can be checked with sudo salt -l error -C 'G@roles:worker' cmd.run 'systemctl is-active os-autoinst-openvswitch; journalctl --since=2020-08-06 -u os-autoinst-opensvswitch | grep "Waiting for IP on bridge"' but should be checked again after multiple reboots of OSD workers.

Actions #13

Updated by livdywan over 3 years ago

okurz wrote:

deployed on osd and can be checked with sudo salt -l error -C 'G@roles:worker' cmd.run 'systemctl is-active os-autoinst-openvswitch; journalctl --since=2020-08-06 -u os-autoinst-opensvswitch | grep "Waiting for IP on bridge"' but should be checked again after multiple reboots of OSD workers.

openqaworker9.suse.de:
    active
openqaworker8.suse.de:
    active
grenache-1.qa.suse.de:
    active
openqaworker2.suse.de:
    active
malbec.arch.suse.de:
    active
QA-Power8-4-kvm.qa.suse.de:
    active
openqaworker6.suse.de:
    active
openqaworker5.suse.de:
    active
QA-Power8-5-kvm.qa.suse.de:
    active
openqaworker-arm-3.suse.de:
    active
openqaworker10.suse.de:
    active
openqaworker13.suse.de:
    active
openqaworker-arm-1.suse.de:
    active
Actions #14

Updated by okurz over 3 years ago

  • Due date changed from 2020-08-27 to 2020-10-14

thx. So with this we should not need the workaround in salt anymore, prepared https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/351

Actions #15

Updated by okurz over 3 years ago

  • Priority changed from Normal to Low
Actions #17

Updated by okurz over 3 years ago

  • Status changed from Feedback to Resolved

sudo salt -l error -C 'G@roles:worker' cmd.run 'systemctl is-active os-autoinst-openvswitch; journalctl --since=2020-09-07 -u os-autoinst-opensvswitch | grep "Waiting for IP on bridge"' returned no match. So the problematic step was not yet encountered however also no problem appeared. Assuming that the fix applied in scripts is effective I call this done.

Actions #18

Updated by okurz over 3 years ago

  • Due date changed from 2020-10-14 to 2020-10-30
  • Status changed from Resolved to Feedback

seems I was getting pretty confused and instead of "removing" the workaround I "readded" it. Removing with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/379

On osd workers I also did sudo salt -l error -C 'G@roles:worker' cmd.run 'systemctl revert os-autoinst-openvswitch.service'

Actions #19

Updated by okurz over 3 years ago

  • Status changed from Feedback to Resolved

Command from #59300#note-17 was not returning any matches. Maybe this time I am actually done :)

Actions

Also available in: Atom PDF