Project

General

Profile

Actions

action #162284

closed

Prevent multi-machine tests to be picked up if os-autoinst-openvswitch service does not work size:M

Added by okurz 6 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2024-06-14
Due date:
% Done:

0%

Estimated time:

Description

Observation

During unintended upgrade of worker31 and others to Leap 15.6 the network did only come up after 20m(!) for yet unknown reasons, see #157975. Additional problems were caused because os-autoinst-openvswitch timed out eventually but then openQA workers picked up and destroyed jobs happily with

backend died: Open vSwitch command 'set_vlan' with arguments 'tap37 121' failed: org.freedesktop.DBus.Error.ServiceUnknown: The name org.opensuse.os_autoinst.switch was not provided by any .service files

Fabian Vogt suggests: "Instead of relying on the systemd .service to be enabled and running you could add a dbus .service file with a SystemdService= key to make use of dbus autolaunch"

Acceptance criteria

  • AC1: openQA workers do not pick up multi-machine tests if the os-autoinst openvswitch service is not available over DBUS
  • AC2: openQA workers clearly communicate this problematic situation

Suggestions

  • Make the worker check for the availability of the requested DBUS service or something if it has a multi-machine worker class and make the worker show up as "broken" which we already use in other cases
  • As alternative look into the suggestion by fvogt but that would still rely on systemd
  • As alternative to declaring the worker "broken" remove the "tap" class dynamically or change it to "tap_broken_$reason"

Related issues 3 (1 open2 closed)

Related to openQA Tests (public) - action #165132: test fails in openqa_worker with 'No such timeout policy "ovs_test_tp"' and other problems regarding move to scripts/Resolvedokurz2024-08-12

Actions
Related to openQA Infrastructure (public) - action #165860: wicked_basic_ref fails on ppc64le: Open vSwitch command 'set_vlan' with arguments 'tap3 8' failedResolvedmkittler2024-08-27

Actions
Copied from openQA Infrastructure (public) - action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:SIn Progressybonatakis2024-12-26

Actions
Actions #1

Updated by okurz 6 months ago

  • Copied from action #157975: Upgrade osd workers to openSUSE Leap 15.6 size:S added
Actions #3

Updated by okurz 4 months ago

  • Related to action #165132: test fails in openqa_worker with 'No such timeout policy "ovs_test_tp"' and other problems regarding move to scripts/ added
Actions #4

Updated by tinita 4 months ago

  • Subject changed from Prevent multi-machine tests to be picked up if os-autoinst-openvswitch service does not work to Prevent multi-machine tests to be picked up if os-autoinst-openvswitch service does not work size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by tinita 4 months ago

  • Target version changed from Tools - Next to Ready
Actions #6

Updated by livdywan 3 months ago

  • Description updated (diff)
Actions #7

Updated by livdywan 3 months ago

  • Related to action #165860: wicked_basic_ref fails on ppc64le: Open vSwitch command 'set_vlan' with arguments 'tap3 8' failed added
Actions #8

Updated by mkittler 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #9

Updated by mkittler 3 months ago

  • Status changed from In Progress to Feedback
Actions #10

Updated by mkittler 3 months ago

  • Status changed from Feedback to Resolved

The PR has been merged. I did a "fullstack testing" of the feature locally (besides the unit tests) so I think it'll work in production when os-autoinst-openvswitch won't be available. I also checked for false positives on o3 (where the change is already deployed) but haven't found any related broken workers.

Actions

Also available in: Atom PDF