Project

General

Profile

Actions

action #167081

closed

test fails in support_server/setup on osd worker37 size:S

Added by acarvajal 2 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-09-19
Due date:
2024-10-09
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-12-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_hawk_supportserver@64bit fails in
setup

Connections to 10.0.2.2 fail.

So far, failures were only observed in worker37

List of failed jobs:
https://openqa.suse.de/tests/15475749#step/setup/78
https://openqa.suse.de/tests/15468640#step/setup/73
https://openqa.suse.de/tests/15468646#step/setup/86
https://openqa.suse.de/tests/15468610#step/setup/86
https://openqa.suse.de/tests/15478397
https://openqa.suse.de/tests/15478325
https://openqa.suse.de/tests/15478676

Reproducible

Fails since (at least) Build :35702:grep (current job)

Expected result

Last good: :35691:xerces-c (or more recent)

Suggestions

  • See what the status of the ovs bridge is on that worker
  • Bring back worker37 back into production after verification
  • Understand why we frickin' still have openQA multi-machine jobs failing without us being alerted if prerequisites on the machines are not fulfilled
  • It should be safe to assume that the services (openvswitch and os-autoinst-openvswitch) work as per #162284
  • Have a look at https://open.qa/docs/#_debugging_open_vswitch_configuration and related sections

Further details

Always latest result in this scenario: latest


Related issues 1 (1 open0 closed)

Related to openQA Infrastructure - action #166802: Recover worker37, worker38, worker39 size:SBlockedokurz

Actions
Actions #1

Updated by nicksinger 2 months ago

  • Related to action #166802: Recover worker37, worker38, worker39 size:S added
Actions #2

Updated by nicksinger 2 months ago

I've removed the worker from production and added a comment here: https://progress.opensuse.org/issues/166802#note-11

Actions #3

Updated by okurz 2 months ago

  • Category set to Regressions/Crashes
  • Target version set to Ready
Actions #4

Updated by dheidler about 2 months ago

  • Subject changed from test fails in support_server/setup to test fails in support_server/setup on osd worker37 size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by mkittler about 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #6

Updated by mkittler about 2 months ago

It looks like the runtime firewall config was wrong because the interfaces where not in the correct zone. I fixed that now. I'm wondering why that is, though. The permanent config looks good.

Actions #7

Updated by openqa_review about 2 months ago

  • Due date set to 2024-10-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #8

Updated by mkittler about 2 months ago

I cloned some test jobs and they succeeded:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/15493582 _GROUP=0 {TEST,BUILD}+=-poo167081 WORKER_CLASS=worker37
Cloning parents of sle-15-SP7-Online-x86_64-Build21.2-ping_server@64bit
Cloning children of sle-15-SP7-Online-x86_64-Build21.2-ping_server@64bit
Cloning parents of sle-15-SP7-Online-x86_64-Build21.2-ping_client@64bit
2 jobs have been created:
 - sle-15-SP7-Online-x86_64-Build21.2-ping_server@64bit -> https://openqa.suse.de/tests/15528008
 - sle-15-SP7-Online-x86_64-Build21.2-ping_client@64bit -> https://openqa.suse.de/tests/15528009

I'll reboot the worker to see whether the firewall config is still correct then (even though the persistent config looked good anyway).

Actions #9

Updated by mkittler about 2 months ago

It still works after a reboot. So I suppose nothing was misconfigured persistently and probably a reboot alone would have helped, too.

I also cloned one of the more complicated scenarios mentioned in the ticket description:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/15499110 _GROUP=0 {TEST,BUILD}+=-poo167081 WORKER_CLASS=worker37
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_supportserver@64bit
Cloning children of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_supportserver@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_client@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node01@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node02@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node03@64bit
5 jobs have been created:
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_supportserver@64bit -> https://openqa.suse.de/tests/15528019
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_client@64bit -> https://openqa.suse.de/tests/15528018
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node01@64bit -> https://openqa.suse.de/tests/15528020
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node02@64bit -> https://openqa.suse.de/tests/15528021
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node03@64bit -> https://openqa.suse.de/tests/15528017

If it works I'd add the worker back to production.


I double checked that our salt states restart firewalld.service in case the firewall config changes and that's the case. I also checked the salt documentation and this is really supposed to work as we think. So I have no idea why the runtime config of the firewall was incorrect.

Actions #10

Updated by mkittler about 2 months ago

All tests passed so I'll move the worker back to production tomorrow.

Actions #11

Updated by mkittler about 2 months ago

Tried to add the worker back to production. I applied salt states again explicitly and all states were applied successfully but this caused the firewall config to go into the broken state again.

Actions #12

Updated by mkittler about 2 months ago

I think the problem was that the zone config /etc/firewalld/zones/public.xml existed on the machine and it contained a conflicting interface entry for the bridge device. So what zone the bridge ended up in was random. I deleted the problematic file and re-ran salt. It looks still good, so salt didn't create the file again. I'll create a SR for salt to make sure the file is deleted.

Actions #13

Updated by mkittler about 2 months ago

  • Status changed from In Progress to Feedback

SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1274

I also triggered another reboot of worker37 to see whether it still works - but more for completeness because the persistent config wasn't the problem here.

Actions #14

Updated by livdywan about 2 months ago · Edited

I guess this is the ticket for worker37 now, which is causing a pipeline in salt-states-openqa to fail despite the parent ticket saying it's not in production:

          ID: wicked ifup all
    Function: cmd.run
      Result: False
     Comment: Command "wicked ifup all" run
     Started: 11:03:57.277556
    Duration: 30188.097 ms
     Changes:   
              ----------
              pid:
                  64970
              retcode:
                  2
              stderr:
                  wicked: Interface wait time (30s) reached
              stdout:
                  lo              up
                  eth0            up
                  br1             up
                  tap1            enslaved
                  tap2            enslaved
[...]
                  tap107          no-device
                  tap171          no-device
                  tap72           device-not-running
                  tap73           device-not-running
[...]
                  tap68           device-not-running
Summary for worker36.oqa.prg2.suse.org
Actions #15

Updated by mkittler about 2 months ago · Edited

Ok, but the end of the summary says worker36 so it is not related to this ticket. In fact, worker37 didn't cause this pipeline to fail:

Summary for worker37.oqa.prg2.suse.org
--------------
Succeeded: 586 (changed=3)
Failed:      0
--------------
Total states run:     586
Total run time:    42.232 s

That #166802 sounds like worker37 is not in production yet is not really a contradiction. I think this ticket is a split-out sub-task of #166802.

Actions #16

Updated by mkittler about 2 months ago

I looked into the removal of the "trusted" zone in tap config files. Note that this is not related to this ticket as it is completely independent of the misconfiguration on worker37 and concerns all workers. (But @okurz asked to look into the issue in the daily today.)

The issue is that we get output like the following when applying salt states:

----------
          ID: /etc/sysconfig/network/ifcfg-tap47
    Function: file.managed
      Result: True
     Comment: File /etc/sysconfig/network/ifcfg-tap47 updated
     Started: 12:52:02.905576
    Duration: 23.445 ms
     Changes:   
              ----------
              diff:
                  --- 
                  +++ 
                  @@ -6,4 +6,3 @@
                   TUNNEL='tap'
                   TUNNEL_SET_GROUP='kvm'
                   TUNNEL_SET_OWNER='_openqa-worker'
                  -ZONE=trusted
              mode:
                  0644
----------
          ID: /etc/sysconfig/network/ifcfg-tap111
    Function: file.managed
      Result: True
     Comment: File /etc/sysconfig/network/ifcfg-tap111 updated
     Started: 12:52:02.931142
    Duration: 23.526 ms
     Changes:   
              ----------
              diff:
                  --- 
                  +++ 
                  @@ -6,4 +6,3 @@
                   TUNNEL='tap'
                   TUNNEL_SET_GROUP='kvm'
                   TUNNEL_SET_OWNER='_openqa-worker'
                  -ZONE=trusted
              mode:
                  0644
…
Summary for worker36.oqa.prg2.suse.org
--------------
Succeeded: 614 (changed=4)
Failed:      0
--------------
Total states run:     614
Total run time:    32.359 s

This is an example of worker36 but I also saw it on worker37 and others. It also not causing any problems because I don't think we need this explicitly configured. At least my test jobs also worked without this. However, wicket seems to re-add the setting automatically and then we get again this diff when applying salt states.

Note that the behavior of wicket is easily reproducible:

martchus@worker36:~> cat /etc/sysconfig/network/ifcfg-tap47
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='hotplug'
TUNNEL='tap'
TUNNEL_SET_GROUP='kvm'
TUNNEL_SET_OWNER='_openqa-worker'
martchus@worker36:~> sudo wicked ifreload tap47
br1             up
tap47           enslaved
martchus@worker36:~> cat /etc/sysconfig/network/ifcfg-tap47
cat: /etc/sysconfig/network/ifcfg-tap47: Permission denied
martchus@worker36:~> l /etc/sysconfig/network/ifcfg-tap47
-rw------- 1 root root 154 Sep 27 11:02 /etc/sysconfig/network/ifcfg-tap47
martchus@worker36:~> sudo cat /etc/sysconfig/network/ifcfg-tap47
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='hotplug'
TUNNEL='tap'
TUNNEL_SET_GROUP='kvm'
TUNNEL_SET_OWNER='_openqa-worker'
ZONE=trusted

It apparently also changed the permissions.

I don't think we can do much about wicked's behavior so I'd simply change our salt states to be in-line with that.

Actions #17

Updated by mkittler about 2 months ago

I also saw the following change on worker36:

----------
          ID: firewalld_zones
    Function: file.managed
        Name: /etc/firewalld/zones/trusted.xml
      Result: True
     Comment: File /etc/firewalld/zones/trusted.xml updated
     Started: 12:51:53.597697
    Duration: 82.615 ms
     Changes:   
              ----------
              diff:
                  --- 
                  +++ 
                  @@ -2,11 +2,9 @@
                   <zone target="ACCEPT">
                     <short>Trusted</short>
                     <description>All network connections are accepted.</description>
                  -  <masquerade/>
                     <interface name="br1"/>
                     <interface name="ovs-system"/>
                     <interface name="eth0"/>
                  -  <interface name="tap47"/>
                  -  <interface name="tap111"/>
                  +  <masquerade/>
                     <forward/>
                   </zone>

Not sure why these tap devices ended up in the persistent firewall config. Here I think it makes sense to have salt cleanup the XML. I would not know how this was modified to contain tap devices - except somebody really played around with firewall-cmd using the permanent flag.

Actions #18

Updated by mkittler about 2 months ago

  • Status changed from Feedback to Resolved

MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1277

However, this is not part of the ticket and also not really important. So I'm resolving the ticket regardless. We can discuss the MR further on GitLab and if it is not something we can merge quickly I suggest @okurz as product owner decides whether it is worth improving this aspect and if yes create a new ticket.

Actions

Also available in: Atom PDF