action #167081: test fails in support_server/setup on osd worker37 size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #167081

closed

test fails in support_server/setup on osd worker37 size:S

Added by acarvajal 8 months ago. Updated 8 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2024-09-19

Due date:

2024-10-09

% Done:

Estimated time:

Description

Observation¶

openQA test in scenario sle-12-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_hawk_supportserver@64bit fails in
setup

Connections to 10.0.2.2 fail.

So far, failures were only observed in worker37

List of failed jobs:
https://openqa.suse.de/tests/15475749#step/setup/78
https://openqa.suse.de/tests/15468640#step/setup/73
https://openqa.suse.de/tests/15468646#step/setup/86
https://openqa.suse.de/tests/15468610#step/setup/86
https://openqa.suse.de/tests/15478397
https://openqa.suse.de/tests/15478325
https://openqa.suse.de/tests/15478676

Reproducible¶

Fails since (at least) Build :35702:grep (current job)

Expected result¶

Last good: :35691:xerces-c (or more recent)

Suggestions¶

See what the status of the ovs bridge is on that worker
Bring back worker37 back into production after verification
Understand why we frickin' still have openQA multi-machine jobs failing without us being alerted if prerequisites on the machines are not fulfilled
It should be safe to assume that the services (openvswitch and os-autoinst-openvswitch) work as per #162284
Have a look at https://open.qa/docs/#_debugging_open_vswitch_configuration and related sections

Further details¶

Always latest result in this scenario: latest

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by nicksinger 8 months ago

Related to action #166802: Recover worker37, worker38, worker39 size:S added

Actions

Copy link

Updated by nicksinger 8 months ago

I've removed the worker from production and added a comment here: https://progress.opensuse.org/issues/166802#note-11

Actions

Copy link

Updated by okurz 8 months ago

Category set to Regressions/Crashes
Target version set to Ready

Actions

Copy link

Updated by dheidler 8 months ago

Subject changed from test fails in support_server/setup to test fails in support_server/setup on osd worker37 size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by mkittler 8 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 8 months ago

It looks like the runtime firewall config was wrong because the interfaces where not in the correct zone. I fixed that now. I'm wondering why that is, though. The permanent config looks good.

Actions

Copy link

Updated by openqa_review 8 months ago

Due date set to 2024-10-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by mkittler 8 months ago

I cloned some test jobs and they succeeded:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/15493582 _GROUP=0 {TEST,BUILD}+=-poo167081 WORKER_CLASS=worker37
Cloning parents of sle-15-SP7-Online-x86_64-Build21.2-ping_server@64bit
Cloning children of sle-15-SP7-Online-x86_64-Build21.2-ping_server@64bit
Cloning parents of sle-15-SP7-Online-x86_64-Build21.2-ping_client@64bit
2 jobs have been created:
 - sle-15-SP7-Online-x86_64-Build21.2-ping_server@64bit -> https://openqa.suse.de/tests/15528008
 - sle-15-SP7-Online-x86_64-Build21.2-ping_client@64bit -> https://openqa.suse.de/tests/15528009

I'll reboot the worker to see whether the firewall config is still correct then (even though the persistent config looked good anyway).

Actions

Copy link

Updated by mkittler 8 months ago

It still works after a reboot. So I suppose nothing was misconfigured persistently and probably a reboot alone would have helped, too.

I also cloned one of the more complicated scenarios mentioned in the ticket description:

openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/15499110 _GROUP=0 {TEST,BUILD}+=-poo167081 WORKER_CLASS=worker37
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_supportserver@64bit
Cloning children of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_supportserver@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_client@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node01@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node02@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node03@64bit
5 jobs have been created:
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_supportserver@64bit -> https://openqa.suse.de/tests/15528019
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_client@64bit -> https://openqa.suse.de/tests/15528018
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node01@64bit -> https://openqa.suse.de/tests/15528020
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node02@64bit -> https://openqa.suse.de/tests/15528021
 - sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node03@64bit -> https://openqa.suse.de/tests/15528017

If it works I'd add the worker back to production.

I double checked that our salt states restart firewalld.service in case the firewall config changes and that's the case. I also checked the salt documentation and this is really supposed to work as we think. So I have no idea why the runtime config of the firewall was incorrect.

Actions

Copy link

#10

Updated by mkittler 8 months ago

All tests passed so I'll move the worker back to production tomorrow.

Actions

Copy link

#11

Updated by mkittler 8 months ago

Tried to add the worker back to production. I applied salt states again explicitly and all states were applied successfully but this caused the firewall config to go into the broken state again.

Actions

Copy link

#12

Updated by mkittler 8 months ago

I think the problem was that the zone config /etc/firewalld/zones/public.xml existed on the machine and it contained a conflicting interface entry for the bridge device. So what zone the bridge ended up in was random. I deleted the problematic file and re-ran salt. It looks still good, so salt didn't create the file again. I'll create a SR for salt to make sure the file is deleted.

Actions

Copy link

#13

Updated by mkittler 8 months ago

Status changed from In Progress to Feedback

SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1274

I also triggered another reboot of worker37 to see whether it still works - but more for completeness because the persistent config wasn't the problem here.

Actions

Copy link

#14

Updated by livdywan 8 months ago · Edited

I guess this is the ticket for worker37 now, which is causing a pipeline in salt-states-openqa to fail despite the parent ticket saying it's not in production:

          ID: wicked ifup all
    Function: cmd.run
      Result: False
     Comment: Command "wicked ifup all" run
     Started: 11:03:57.277556
    Duration: 30188.097 ms
     Changes:   
              ----------
              pid:
                  64970
              retcode:
                  2
              stderr:
                  wicked: Interface wait time (30s) reached
              stdout:
                  lo              up
                  eth0            up
                  br1             up
                  tap1            enslaved
                  tap2            enslaved
[...]
                  tap107          no-device
                  tap171          no-device
                  tap72           device-not-running
                  tap73           device-not-running
[...]
                  tap68           device-not-running
Summary for worker36.oqa.prg2.suse.org

Actions

Copy link

#15

Updated by mkittler 8 months ago · Edited

Ok, but the end of the summary says worker36 so it is not related to this ticket. In fact, worker37 didn't cause this pipeline to fail:

Summary for worker37.oqa.prg2.suse.org
--------------
Succeeded: 586 (changed=3)
Failed:      0
--------------
Total states run:     586
Total run time:    42.232 s

That #166802 sounds like worker37 is not in production yet is not really a contradiction. I think this ticket is a split-out sub-task of #166802.

Actions

Copy link

#16

Updated by mkittler 8 months ago

I looked into the removal of the "trusted" zone in tap config files. Note that this is not related to this ticket as it is completely independent of the misconfiguration on worker37 and concerns all workers. (But @okurz asked to look into the issue in the daily today.)

The issue is that we get output like the following when applying salt states:

----------
          ID: /etc/sysconfig/network/ifcfg-tap47
    Function: file.managed
      Result: True
     Comment: File /etc/sysconfig/network/ifcfg-tap47 updated
     Started: 12:52:02.905576
    Duration: 23.445 ms
     Changes:   
              ----------
              diff:
                  --- 
                  +++ 
                  @@ -6,4 +6,3 @@
                   TUNNEL='tap'
                   TUNNEL_SET_GROUP='kvm'
                   TUNNEL_SET_OWNER='_openqa-worker'
                  -ZONE=trusted
              mode:
                  0644
----------
          ID: /etc/sysconfig/network/ifcfg-tap111
    Function: file.managed
      Result: True
     Comment: File /etc/sysconfig/network/ifcfg-tap111 updated
     Started: 12:52:02.931142
    Duration: 23.526 ms
     Changes:   
              ----------
              diff:
                  --- 
                  +++ 
                  @@ -6,4 +6,3 @@
                   TUNNEL='tap'
                   TUNNEL_SET_GROUP='kvm'
                   TUNNEL_SET_OWNER='_openqa-worker'
                  -ZONE=trusted
              mode:
                  0644
…
Summary for worker36.oqa.prg2.suse.org
--------------
Succeeded: 614 (changed=4)
Failed:      0
--------------
Total states run:     614
Total run time:    32.359 s

This is an example of worker36 but I also saw it on worker37 and others. It also not causing any problems because I don't think we need this explicitly configured. At least my test jobs also worked without this. However, wicket seems to re-add the setting automatically and then we get again this diff when applying salt states.

Note that the behavior of wicket is easily reproducible:

martchus@worker36:~> cat /etc/sysconfig/network/ifcfg-tap47
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='hotplug'
TUNNEL='tap'
TUNNEL_SET_GROUP='kvm'
TUNNEL_SET_OWNER='_openqa-worker'
martchus@worker36:~> sudo wicked ifreload tap47
br1             up
tap47           enslaved
martchus@worker36:~> cat /etc/sysconfig/network/ifcfg-tap47
cat: /etc/sysconfig/network/ifcfg-tap47: Permission denied
martchus@worker36:~> l /etc/sysconfig/network/ifcfg-tap47
-rw------- 1 root root 154 Sep 27 11:02 /etc/sysconfig/network/ifcfg-tap47
martchus@worker36:~> sudo cat /etc/sysconfig/network/ifcfg-tap47
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='hotplug'
TUNNEL='tap'
TUNNEL_SET_GROUP='kvm'
TUNNEL_SET_OWNER='_openqa-worker'
ZONE=trusted

It apparently also changed the permissions.

I don't think we can do much about wicked's behavior so I'd simply change our salt states to be in-line with that.

Actions

Copy link

#17

Updated by mkittler 8 months ago

I also saw the following change on worker36:

----------
          ID: firewalld_zones
    Function: file.managed
        Name: /etc/firewalld/zones/trusted.xml
      Result: True
     Comment: File /etc/firewalld/zones/trusted.xml updated
     Started: 12:51:53.597697
    Duration: 82.615 ms
     Changes:   
              ----------
              diff:
                  --- 
                  +++ 
                  @@ -2,11 +2,9 @@
                   <zone target="ACCEPT">
                     <short>Trusted</short>
                     <description>All network connections are accepted.</description>
                  -  <masquerade/>
                     <interface name="br1"/>
                     <interface name="ovs-system"/>
                     <interface name="eth0"/>
                  -  <interface name="tap47"/>
                  -  <interface name="tap111"/>
                  +  <masquerade/>
                     <forward/>
                   </zone>

Not sure why these tap devices ended up in the persistent firewall config. Here I think it makes sense to have salt cleanup the XML. I would not know how this was modified to contain tap devices - except somebody really played around with firewall-cmd using the permanent flag.

Actions

Copy link

#18

Updated by mkittler 8 months ago

Status changed from Feedback to Resolved

MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1277

However, this is not part of the ticket and also not really important. So I'm resolving the ticket regardless. We can discuss the MR further on GitLab and if it is not something we can merge quickly I suggest @okurz as product owner decides whether it is worth improving this aspect and if yes create a new ticket.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #167081

test fails in support_server/setup on osd worker37 size:S

Observation¶

Reproducible¶

Expected result¶

Suggestions¶

Further details¶

Updated by nicksinger 8 months ago

Updated by nicksinger 8 months ago

Updated by okurz 8 months ago

Updated by dheidler 8 months ago

Updated by mkittler 8 months ago

Updated by mkittler 8 months ago

Updated by openqa_review 8 months ago

Updated by mkittler 8 months ago

Updated by mkittler 8 months ago

Updated by mkittler 8 months ago

Updated by mkittler 8 months ago

Updated by mkittler 8 months ago

Updated by mkittler 8 months ago

Updated by livdywan 8 months ago · Edited

Updated by mkittler 8 months ago · Edited

Updated by mkittler 8 months ago

Updated by mkittler 8 months ago

Updated by mkittler 8 months ago