action #167081
closedtest fails in support_server/setup on osd worker37 size:S
Added by acarvajal 2 months ago. Updated about 2 months ago.
0%
Description
Observation¶
openQA test in scenario sle-12-SP5-Server-DVD-HA-Incidents-x86_64-qam_ha_hawk_supportserver@64bit fails in
setup
Connections to 10.0.2.2 fail.
So far, failures were only observed in worker37
List of failed jobs:
https://openqa.suse.de/tests/15475749#step/setup/78
https://openqa.suse.de/tests/15468640#step/setup/73
https://openqa.suse.de/tests/15468646#step/setup/86
https://openqa.suse.de/tests/15468610#step/setup/86
https://openqa.suse.de/tests/15478397
https://openqa.suse.de/tests/15478325
https://openqa.suse.de/tests/15478676
Reproducible¶
Fails since (at least) Build :35702:grep (current job)
Expected result¶
Last good: :35691:xerces-c (or more recent)
Suggestions¶
- See what the status of the ovs bridge is on that worker
- Bring back worker37 back into production after verification
- Understand why we frickin' still have openQA multi-machine jobs failing without us being alerted if prerequisites on the machines are not fulfilled
- It should be safe to assume that the services (openvswitch and os-autoinst-openvswitch) work as per #162284
- Have a look at https://open.qa/docs/#_debugging_open_vswitch_configuration and related sections
Further details¶
Always latest result in this scenario: latest
Updated by nicksinger 2 months ago
- Related to action #166802: Recover worker37, worker38, worker39 size:S added
Updated by nicksinger 2 months ago
I've removed the worker from production and added a comment here: https://progress.opensuse.org/issues/166802#note-11
Updated by dheidler about 2 months ago
- Subject changed from test fails in support_server/setup to test fails in support_server/setup on osd worker37 size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler about 2 months ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler about 2 months ago
It looks like the runtime firewall config was wrong because the interfaces where not in the correct zone. I fixed that now. I'm wondering why that is, though. The permanent config looks good.
Updated by openqa_review about 2 months ago
- Due date set to 2024-10-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 2 months ago
I cloned some test jobs and they succeeded:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/15493582 _GROUP=0 {TEST,BUILD}+=-poo167081 WORKER_CLASS=worker37
Cloning parents of sle-15-SP7-Online-x86_64-Build21.2-ping_server@64bit
Cloning children of sle-15-SP7-Online-x86_64-Build21.2-ping_server@64bit
Cloning parents of sle-15-SP7-Online-x86_64-Build21.2-ping_client@64bit
2 jobs have been created:
- sle-15-SP7-Online-x86_64-Build21.2-ping_server@64bit -> https://openqa.suse.de/tests/15528008
- sle-15-SP7-Online-x86_64-Build21.2-ping_client@64bit -> https://openqa.suse.de/tests/15528009
I'll reboot the worker to see whether the firewall config is still correct then (even though the persistent config looked good anyway).
Updated by mkittler about 2 months ago
It still works after a reboot. So I suppose nothing was misconfigured persistently and probably a reboot alone would have helped, too.
I also cloned one of the more complicated scenarios mentioned in the ticket description:
openqa-clone-job --skip-chained-deps --within-instance https://openqa.suse.de/tests/15499110 _GROUP=0 {TEST,BUILD}+=-poo167081 WORKER_CLASS=worker37
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_supportserver@64bit
Cloning children of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_supportserver@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_client@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node01@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node02@64bit
Cloning parents of sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node03@64bit
5 jobs have been created:
- sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_supportserver@64bit -> https://openqa.suse.de/tests/15528019
- sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_client@64bit -> https://openqa.suse.de/tests/15528018
- sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node01@64bit -> https://openqa.suse.de/tests/15528020
- sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node02@64bit -> https://openqa.suse.de/tests/15528021
- sle-15-SP4-Server-DVD-HA-Incidents-x86_64-Build:35759:python-pyzmq-qam_3nodes_node03@64bit -> https://openqa.suse.de/tests/15528017
If it works I'd add the worker back to production.
I double checked that our salt states restart firewalld.service
in case the firewall config changes and that's the case. I also checked the salt documentation and this is really supposed to work as we think. So I have no idea why the runtime config of the firewall was incorrect.
Updated by mkittler about 2 months ago
All tests passed so I'll move the worker back to production tomorrow.
Updated by mkittler about 2 months ago
Tried to add the worker back to production. I applied salt states again explicitly and all states were applied successfully but this caused the firewall config to go into the broken state again.
Updated by mkittler about 2 months ago
I think the problem was that the zone config /etc/firewalld/zones/public.xml
existed on the machine and it contained a conflicting interface entry for the bridge device. So what zone the bridge ended up in was random. I deleted the problematic file and re-ran salt. It looks still good, so salt didn't create the file again. I'll create a SR for salt to make sure the file is deleted.
Updated by mkittler about 2 months ago
- Status changed from In Progress to Feedback
SR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1274
I also triggered another reboot of worker37 to see whether it still works - but more for completeness because the persistent config wasn't the problem here.
Updated by livdywan about 2 months ago · Edited
I guess this is the ticket for worker37 now, which is causing a pipeline in salt-states-openqa to fail despite the parent ticket saying it's not in production:
ID: wicked ifup all
Function: cmd.run
Result: False
Comment: Command "wicked ifup all" run
Started: 11:03:57.277556
Duration: 30188.097 ms
Changes:
----------
pid:
64970
retcode:
2
stderr:
wicked: Interface wait time (30s) reached
stdout:
lo up
eth0 up
br1 up
tap1 enslaved
tap2 enslaved
[...]
tap107 no-device
tap171 no-device
tap72 device-not-running
tap73 device-not-running
[...]
tap68 device-not-running
Summary for worker36.oqa.prg2.suse.org
Updated by mkittler about 2 months ago · Edited
Ok, but the end of the summary says worker36 so it is not related to this ticket. In fact, worker37 didn't cause this pipeline to fail:
Summary for worker37.oqa.prg2.suse.org
--------------
Succeeded: 586 (changed=3)
Failed: 0
--------------
Total states run: 586
Total run time: 42.232 s
That #166802 sounds like worker37 is not in production yet is not really a contradiction. I think this ticket is a split-out sub-task of #166802.
Updated by mkittler about 2 months ago
I looked into the removal of the "trusted" zone in tap config files. Note that this is not related to this ticket as it is completely independent of the misconfiguration on worker37 and concerns all workers. (But @okurz asked to look into the issue in the daily today.)
The issue is that we get output like the following when applying salt states:
----------
ID: /etc/sysconfig/network/ifcfg-tap47
Function: file.managed
Result: True
Comment: File /etc/sysconfig/network/ifcfg-tap47 updated
Started: 12:52:02.905576
Duration: 23.445 ms
Changes:
----------
diff:
---
+++
@@ -6,4 +6,3 @@
TUNNEL='tap'
TUNNEL_SET_GROUP='kvm'
TUNNEL_SET_OWNER='_openqa-worker'
-ZONE=trusted
mode:
0644
----------
ID: /etc/sysconfig/network/ifcfg-tap111
Function: file.managed
Result: True
Comment: File /etc/sysconfig/network/ifcfg-tap111 updated
Started: 12:52:02.931142
Duration: 23.526 ms
Changes:
----------
diff:
---
+++
@@ -6,4 +6,3 @@
TUNNEL='tap'
TUNNEL_SET_GROUP='kvm'
TUNNEL_SET_OWNER='_openqa-worker'
-ZONE=trusted
mode:
0644
…
Summary for worker36.oqa.prg2.suse.org
--------------
Succeeded: 614 (changed=4)
Failed: 0
--------------
Total states run: 614
Total run time: 32.359 s
This is an example of worker36 but I also saw it on worker37 and others. It also not causing any problems because I don't think we need this explicitly configured. At least my test jobs also worked without this. However, wicket seems to re-add the setting automatically and then we get again this diff when applying salt states.
Note that the behavior of wicket is easily reproducible:
martchus@worker36:~> cat /etc/sysconfig/network/ifcfg-tap47
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='hotplug'
TUNNEL='tap'
TUNNEL_SET_GROUP='kvm'
TUNNEL_SET_OWNER='_openqa-worker'
martchus@worker36:~> sudo wicked ifreload tap47
br1 up
tap47 enslaved
martchus@worker36:~> cat /etc/sysconfig/network/ifcfg-tap47
cat: /etc/sysconfig/network/ifcfg-tap47: Permission denied
martchus@worker36:~> l /etc/sysconfig/network/ifcfg-tap47
-rw------- 1 root root 154 Sep 27 11:02 /etc/sysconfig/network/ifcfg-tap47
martchus@worker36:~> sudo cat /etc/sysconfig/network/ifcfg-tap47
BOOTPROTO='none'
IPADDR=''
NETMASK=''
PREFIXLEN=''
STARTMODE='hotplug'
TUNNEL='tap'
TUNNEL_SET_GROUP='kvm'
TUNNEL_SET_OWNER='_openqa-worker'
ZONE=trusted
It apparently also changed the permissions.
I don't think we can do much about wicked's behavior so I'd simply change our salt states to be in-line with that.
Updated by mkittler about 2 months ago
I also saw the following change on worker36:
----------
ID: firewalld_zones
Function: file.managed
Name: /etc/firewalld/zones/trusted.xml
Result: True
Comment: File /etc/firewalld/zones/trusted.xml updated
Started: 12:51:53.597697
Duration: 82.615 ms
Changes:
----------
diff:
---
+++
@@ -2,11 +2,9 @@
<zone target="ACCEPT">
<short>Trusted</short>
<description>All network connections are accepted.</description>
- <masquerade/>
<interface name="br1"/>
<interface name="ovs-system"/>
<interface name="eth0"/>
- <interface name="tap47"/>
- <interface name="tap111"/>
+ <masquerade/>
<forward/>
</zone>
Not sure why these tap devices ended up in the persistent firewall config. Here I think it makes sense to have salt cleanup the XML. I would not know how this was modified to contain tap devices - except somebody really played around with firewall-cmd using the permanent flag.
Updated by mkittler about 2 months ago
- Status changed from Feedback to Resolved
MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1277
However, this is not part of the ticket and also not really important. So I'm resolving the ticket regardless. We can discuss the MR further on GitLab and if it is not something we can merge quickly I suggest @okurz as product owner decides whether it is worth improving this aspect and if yes create a new ticket.