Project

General

Profile

action #106594

[tools] openqaworker-arm-3 periodically fails os-autoinst-openvswitch service

Added by osukup 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Target version:
Start date:
2022-02-10
Due date:
% Done:

0%

Estimated time:

Description

Observation

from journalctl -xe -u os-autoinst-openvswitch

úno 09 21:56:21 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 300s left ...
úno 09 21:56:22 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 299s left ...
....
úno 09 22:01:20 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 3s left ...
úno 09 22:01:21 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 2s left ...
úno 09 22:01:22 openqaworker-arm-3 os-autoinst-openvswitch[2924]: can't parse bridge local port IP at /usr/lib/os-autoinst/os-autoinst-openvswitch line 43.
úno 09 22:01:22 openqaworker-arm-3 os-autoinst-openvswitch[2924]: Waiting for IP on bridge 'br1', 1s left ...
úno 09 22:01:22 openqaworker-arm-3 systemd[1]: os-autoinst-openvswitch.service: Main process exited, code=exited, status=255/EXCEPTION

Default timeout is 60 seconds, on openqaworker-arm-3 is now 5 minutes, but still isn't enough after system reboot

Rollback steps

  • Unpause alert "Failed systemd services alert (except openqa.suse.de)"systemd services (

History

#1 Updated by okurz 5 months ago

  • Subject changed from [qe-tools] openqaworker-arm-3 perodicaly fails os-autoinst-openvswitch service to [tools] openqaworker-arm-3 periodically fails os-autoinst-openvswitch service
  • Target version set to Ready

#2 Updated by okurz 5 months ago

  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to Urgent

This is still triggering recurring alerts. Please pause alerts referencing this ticket, note down as rollback steps in the ticket description and then we should look into solving the problem. Just saw two alerts again this morning. journalctl -b shows what could be related:

Feb 11 06:40:40 openqaworker-arm-3 chown[1902]: /usr/bin/chown: cannot access '/run/openvswitch': No such file or directory
Feb 11 06:40:40 openqaworker-arm-3 systemd-udevd[1231]: Network interface NamePolicy= disabled by default.
Feb 11 06:40:40 openqaworker-arm-3 systemd[1]: Started wicked DHCPv6 supplicant service.
Feb 11 06:40:40 openqaworker-arm-3 systemd[1]: Started wicked DHCPv4 supplicant service.
…
Feb 11 06:40:41 openqaworker-arm-3 ovs-ctl[1908]: Starting ovsdb-server..done
Feb 11 06:40:41 openqaworker-arm-3 ovs-vsctl[1982]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait -- init -- set Open_vSwitch . db-version=8.2.0
Feb 11 06:40:41 openqaworker-arm-3 ovs-vsctl[1987]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait set Open_vSwitch . ovs-version=2.14.2 "external-ids:system-id=\"c8f61284-c5d2-4bce-ad54-36e3ee4b5712\"" "external-ids:rundir=\"/run/openvswitch\"" "system-type=\"opensuse-leap\"" "system-version=\"15.3\""
Feb 11 06:40:41 openqaworker-arm-3 ovs-ctl[1908]: Configuring Open vSwitch system IDs..done
Feb 11 06:40:41 openqaworker-arm-3 ovs-ctl[1990]: hostname: Name or service not known
Feb 11 06:40:41 openqaworker-arm-3 ovs-ctl[1908]: Enabling remote OVSDB managers..done
Feb 11 06:40:41 openqaworker-arm-3 ovs-vsctl[1994]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait add Open_vSwitch . external-ids hostname=openqaworker-arm-3
Feb 11 06:40:41 openqaworker-arm-3 systemd[1]: Started Open vSwitch Database Unit.
Feb 11 06:40:41 openqaworker-arm-3 systemd[1]: Starting Open vSwitch Delete Transient Ports...
Feb 11 06:40:41 openqaworker-arm-3 ModemManager[1869]: <info>  Couldn't check support for device '/sys/devices/platform/soc@0/84a000000000.pci/pci0002:00/0002:00:02.0/0002:01:00.1': not supported by any plugin
Feb 11 06:40:41 openqaworker-arm-3 ModemManager[1869]: <info>  Couldn't check support for device '/sys/devices/platform/soc@0/84a000000000.pci/pci0002:00/0002:00:02.0/0002:01:00.2': not supported by any plugin
Feb 11 06:40:41 openqaworker-arm-3 ModemManager[1869]: <info>  Couldn't check support for device '/sys/devices/platform/soc@0/84a000000000.pci/pci0002:00/0002:00:02.0/0002:01:00.3': not supported by any plugin
Feb 11 06:40:41 openqaworker-arm-3 ModemManager[1869]: <info>  Couldn't check support for device '/sys/devices/platform/soc@0/84a000000000.pci/pci0002:00/0002:00:02.0/0002:01:00.4': not supported by any plugin
Feb 11 06:40:41 openqaworker-arm-3 ModemManager[1869]: <info>  Couldn't check support for device '/sys/devices/platform/soc@0/84a000000000.pci/pci0002:00/0002:00:02.0/0002:01:00.5': not supported by any plugin
Feb 11 06:40:41 openqaworker-arm-3 systemd[1]: Finished Open vSwitch Delete Transient Ports.
Feb 11 06:40:41 openqaworker-arm-3 systemd[1]: Starting Open vSwitch Forwarding Unit...
Feb 11 06:40:41 openqaworker-arm-3 ovs-ctl[2051]: Inserting openvswitch module..done
Feb 11 06:40:41 openqaworker-arm-3 kernel: openvswitch: Open vSwitch switching datapath
Feb 11 06:40:41 openqaworker-arm-3 systemd-udevd[1961]: Using default interface naming scheme 'v238'.
Feb 11 06:40:41 openqaworker-arm-3 systemd-udevd[1961]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 11 06:40:41 openqaworker-arm-3 systemd-udevd[1961]: ovs-system: Could not generate persistent MAC: No data available
Feb 11 06:40:41 openqaworker-arm-3 kernel: device ovs-system entered promiscuous mode
Feb 11 06:40:41 openqaworker-arm-3 kernel: Timeout policy base is empty
Feb 11 06:40:41 openqaworker-arm-3 kernel: Failed to associated timeout policy `ovs_test_tp'
Feb 11 06:40:42 openqaworker-arm-3 kernel: gre: GRE over IPv4 demultiplexor driver
Feb 11 06:40:42 openqaworker-arm-3 systemd-udevd[1961]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 11 06:40:42 openqaworker-arm-3 kernel: ip_gre: GRE over IPv4 tunneling driver
Feb 11 06:40:42 openqaworker-arm-3 systemd-udevd[1962]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 11 06:40:42 openqaworker-arm-3 systemd-udevd[1964]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 11 06:40:42 openqaworker-arm-3 systemd-udevd[2188]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 11 06:40:42 openqaworker-arm-3 kernel: device gre_sys entered promiscuous mode
Feb 11 06:40:42 openqaworker-arm-3 systemd-udevd[2188]: Using default interface naming scheme 'v238'.
Feb 11 06:40:42 openqaworker-arm-3 systemd-udevd[2188]: gre_sys: Could not generate persistent MAC: No data available
Feb 11 06:40:42 openqaworker-arm-3 kernel: device br1 entered promiscuous mode
Feb 11 06:40:42 openqaworker-arm-3 systemd-udevd[1961]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 11 06:40:42 openqaworker-arm-3 systemd-udevd[1961]: br1: Could not generate persistent MAC: No data available
Feb 11 06:40:43 openqaworker-arm-3 ovs-ctl[2009]: Starting ovs-vswitchd..done
Feb 11 06:40:43 openqaworker-arm-3 ovs-ctl[2393]: hostname: Name or service not known
Feb 11 06:40:43 openqaworker-arm-3 ovs-ctl[2009]: Enabling remote OVSDB managers..done
Feb 11 06:40:43 openqaworker-arm-3 systemd[1]: Started Open vSwitch Forwarding Unit.
Feb 11 06:40:43 openqaworker-arm-3 systemd[1]: Starting Open vSwitch...
Feb 11 06:40:43 openqaworker-arm-3 ovs-vsctl[2397]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --no-wait add Open_vSwitch . external-ids hostname=openqaworker-arm-3
Feb 11 06:40:43 openqaworker-arm-3 systemd[1]: Finished Open vSwitch.
Feb 11 06:40:43 openqaworker-arm-3 systemd[1]: Starting wicked network management service daemon...
Feb 11 06:40:43 openqaworker-arm-3 systemd[1]: Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
Feb 11 06:40:43 openqaworker-arm-3 systemd[1]: Started wicked network management service daemon.
Feb 11 06:40:43 openqaworker-arm-3 systemd[1]: Starting wicked network nanny service...
Feb 11 06:40:43 openqaworker-arm-3 systemd[1]: Started wicked network nanny service.
Feb 11 06:40:43 openqaworker-arm-3 systemd[1]: Starting wicked managed network interfaces...
…
Feb 11 06:40:44 openqaworker-arm-3 systemd-udevd[1964]: Using default interface naming scheme 'v238'.
Feb 11 06:40:44 openqaworker-arm-3 systemd-udevd[1964]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 11 06:40:44 openqaworker-arm-3 systemd-udevd[1964]: tap100: Could not generate persistent MAC: No data available
Feb 11 06:40:44 openqaworker-arm-3 systemd-udevd[1961]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 11 06:40:44 openqaworker-arm-3 systemd-udevd[1961]: tap101: Could not generate persistent MAC: No data available
Feb 11 06:40:44 openqaworker-arm-3 systemd-udevd[1962]: Using default interface naming scheme 'v238'.
… (a lot of text over 8 minutes where it looks like nobody actually asked for an address, and then this vvv )
Feb 11 06:48:59 openqaworker-arm-3 wickedd-dhcp4[1899]: eth1: Request to acquire DHCPv4 lease with UUID bdf80562-4ffc-0b00-5f09-0000ba000000

and then everything looks good. So waiting for longer would likely help but sounds like we have a severe network misconfiguration problem or wicked bug.

I bumped the timeout in openqaworker-arm-3:/etc/systemd/system/os-autoinst-openvswitch.service.d/30-init-timeout.conf from 300s (5m) to 1200s (15m) and triggered a reboot and created the according salt repo change in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/649

I paused the alert "Failed systemd services alert (except openqa.suse.de)"

#3 Updated by okurz 5 months ago

  • Description updated (diff)

#4 Updated by okurz 5 months ago

  • Status changed from In Progress to Resolved

I unpaused https://stats.openqa-monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1&editPanel=6&tab=alert again after checking that it wasn't triggering the past hours. MR https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/649 merged and effective on openqaworker-arm-3 though likely not rolled out yet on other machines. That should be ok because the problem appeared only on one machine so far. Yes, there likely is still a bigger misconfiguration or broken network but with our experiences we can likely not expect much help from others and also the labs are moving, different building, rooms, different network, yada yada.

Also available in: Atom PDF