action #157414
closedNetwork broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M
0%
Description
Observation¶
openQA test in scenario microos-Tumbleweed-DVD-aarch64-remote_ssh_controller@aarch64 fails in
await_install
possibly similar to what had happened in #150920 and #155278
Test suite description¶
Maintainer: jrivera Install remote server (parallel job) with ssh.
Reproducible¶
Fails since (at least) Build 20240314 (current job)
Expected result¶
Last good: 20240310 (or more recent)
Suggestions¶
- Extend the existing test code as suggested in #157414-8 to have more explicit error messages
- Lookup the history of tickets in #150920, #155278
- Consider extending our setup multimachine script and potentially call it periodically?
- Consider more explicit error checks in our worker code to prevent even running into such problems in openQA tests
Further details¶
Always latest result in this scenario: latest
Out of scope¶
- Find a persistent solution, see #159414
Updated by okurz 8 months ago
- Related to action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added
Updated by jbaier_cz 8 months ago
https://openqa.opensuse.org/tests/4018286#step/remote_controller/14 shows
[ 63.657904] named[3159]: timed out resolving './DNSKEY/IN': 10.150.1.11#53
[ 63.659073] named[3159]: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53
[ 63.660290] named[3159]: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53
[ 63.661531] named[3159]: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53
Updated by jbaier_cz 8 months ago
- Related to action #155278: o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:M added
Updated by jbaier_cz 8 months ago
- Status changed from In Progress to New
- Priority changed from Urgent to Normal
This is #150920 yet again.
It looks like #155278#note-8 did not make a difference
openqaworker-arm21:~ # cat /etc/sysconfig/network/ifcfg-eth0
BOOTPROTO='dhcp'
STARTMODE='auto'
ZONE=trusted
openqaworker-arm21:~ # firewall-cmd --list-all-zones
...
public (active)
target: default
icmp-block-inversion: no
interfaces: eth0
Maybe something else is messing with the firewall configuration.
I reapplied #150920#note-25
firewall-cmd --zone public --remove-interface=eth0
firewall-cmd --zone trusted --add-interface=eth0
That should lower the severity. I will try to investigate what was wrong and try another permanent solution
Updated by jbaier_cz 8 months ago
- Tags changed from reactive work to reactive work, multi-machine
- Subject changed from Network broken with multimachine on openqaworker-arm22 to Network broken with multimachine on multiple workers (broken packet forwarding / NAT)
- Assignee deleted (
jbaier_cz) - Priority changed from Normal to High
So the issue is not limited to arm workers: https://openqa.opensuse.org/tests/4029113
We have the same symptoms and:
openqaworker23:~ # firewall-cmd --list-all-zones
...
public (active)
target: default
icmp-block-inversion: no
interfaces: eth0
openqaworker23:~ # cat /etc/sysconfig/network/ifcfg-eth0
BOOTPROTO='none'
STARTMODE='auto'
ZONE=public
I changed that to the trusted zone, but it did not fixed the issue there. The problem is clearly visible in the setup_multimachine: https://openqa.opensuse.org/tests/4029215#step/setup_multimachine/110 where the network manager already shows limited connectivity (maybe we should at least issue a warning in the test at this step).
On the other hand https://openqa.opensuse.org/tests/4029081 failed in the very same way on openqaworker25 where the interface is already in the correct zone. The last successful run https://openqa.opensuse.org/tests/4026044 was on openqaworker26. I tried to compare those workers but I am unable to find any difference in the network setup
So we might have an additional unknown problem here. Raising priority again and as I will be on a vacation, I am leaving this ticket free for someone to take over.
Updated by mkittler 8 months ago · Edited
We could lower the urgency by working around the problem via https://github.com/os-autoinst/openQA/pull/5536.
Looks like the latest run worked again and it ran across worker21 and 22. Runs from before across worker22 and 24 were passing as well. Do we know what set of workers is affected?
EDIT: I tried to figure out good vs. problematic workers via the job history:
good: 21, 22, 24
bad: 23, 26
unknown: 25
Although I'm not sure how much sense that makes considering 7 days ago even jobs across 23 and 26 passed (https://openqa.opensuse.org/tests/4025897). So this might not even be specific to certain hosts.
Updated by mkittler 8 months ago · Edited
/etc/wicked/scripts/gre_tunnel_preup.sh
looks good on all workers. (w20 seems non-existent; w27, w28, arm21 and arm22 are not connected to the gre network at all; this leaves w21 to w26 as valid workers in accordance with how the tap worker class is assigned)- enough tap devices are configured on all hosts (30 slots, so the highest number would be tap158) and in the trusted zone according to
firewall-cmd --get-active-zones
- except on w22 where the public is also active and containing the tap interfaces cat /proc/sys/net/ipv4/ip_forward
also returns 1 on all hostsfor i in $hosts; do echo $i && ssh root@$i "firewall-cmd --query-forward --zone=trusted" ; done
returns "no" on all hosts; I would have expected a "yes" but it looks like we just don't configure forwarding via firewalld here.
I changed the zoning on w22 via for i in {0..178}; do firewall-cmd --remove-interface=tap$i --zone=public ; done
and for i in {0..178}; do firewall-cmd --add-interface=tap$i --zone=trusted ; done
. Now firewall-cmd --get-active-zones
only shows trusted
anymore (with all tap devices). Unfortunately I was distracted so there was a delay of 2 hours between those two commands. It doesn't look like this caused much disruption, though. I also ran the same commands with --permanent
but got warnings like Warning: ALREADY_ENABLED: tap94
so this was likely not necessary.
Updated by openqa_review 8 months ago
- Due date set to 2024-04-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler 8 months ago
The Next & Previous tab in the intial (arm) scenario looks very good. Recent jobs in the (x86_64) wireguard scenario (https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=wireguard_client&version=Tumbleweed) are also passing. I'll keep observing the situation but would not workaround the problem right now (by forcing jobs to run on a single worker) as it doesn't seem very problematic right now.
Updated by mkittler 8 months ago
On w22 the zone setting still revered after I've just rebooted the machine. So I ran for i in {0..178}; do firewall-cmd --permanent --remove-interface=tap$i --zone=public ; firewall-cmd --permanent --add-interface=tap$i --zone=trusted ; done
and rebooted the machine. Now the zone setting is in fact persistent. (That means previously the "permanent" setting was wrong on w22.)
I also rebooted on of the other hosts to see whether that changes anything but it seems not the case. So I assume only w22 was wrongly configured.
Note that w22 being problematic could even explain failures seen in clusters that only ran across other hosts due to our way of connecting each host with each other host.
Updated by mkittler 8 months ago
- Status changed from In Progress to Resolved
Both scenarios still look good. So I'm resolving this ticket now because I wouldn't know what else to improve. Probably the wrong setup on w22 was the culprit (besides the initial problem on arm workers which has already been fixed anyway).
Updated by okurz 8 months ago
- Status changed from Resolved to Feedback
From https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$ofSOPVkvl0cltkJ28p2DkZ3oxVKuL9OBiEuLO4C5xyw
We have again issues with network and multi machine : https://openqa.opensuse.org/tests/4069622#step/networking/15
Could be #155278 as well.
Updated by jbaier_cz 8 months ago · Edited
- Priority changed from Urgent to High
# firewall-cmd --list-all-zones
...
public (active)
target: default
icmp-block-inversion: no
interfaces: eth0
sources:
services: dhcpv6-client ssh
ports:
protocols:
forward: no
masquerade: no
forward-ports:
source-ports:
icmp-blocks:
rich rules:
I executed firewall-cmd --permanent --zone=public --remove-interface=eth0; firewall-cmd --permanent --zone=trusted --add-interface=eth0
which should at least lower the severity again as that should fix the tests.
Updated by ggardet_arm 8 months ago
Updated by mkittler 8 months ago · Edited
I now also updated the runtime and permanent configuration (as it looks like only the permanent configuration was updated by @jbaier_cz). The persistent configuration looked good except that on arm21 the eth0 interface was in two zones at once which I fixed (leading to firewalld complaining about it: Error: INVALID_ZONE: public trusted (ERROR: interface 'eth0' is in 2 zone XML files, can be only in one)
).
The config only appeared broken on the arm workers and on openqaworker23. Not sure why eth0 was moved again to the public zone there.
I restarted the failing job, let's see whether it worked: https://openqa.opensuse.org/tests/4071829#dependencies
EDIT: The tests are now passing again. So it was really just the firewalld settings again. Last time I only considered x86_64 hosts (as @jbaier_cz had already taken care of arm hosts before) and also checked only one other host's persistent config by rebooting. So probably I just missed updating the persistent config of some hosts.
So I've now just run for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --permanent --zone=trusted --change-interface=eth0" ; done
to make sure the persistent config on all hosts is correct (and for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --zone=trusted --change-interface=eth0" ; done
to see whether it actually worked).
Updated by livdywan 8 months ago · Edited
- Subject changed from Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M to Network broken with multimachine on multiple workers (broken packet forwarding / NAT) auto_review:"Test died.*curl -L openqa.opensuse.org.* failed at.*testapi":retry size:M
I restarted the failing job, let's see whether it worked: https://openqa.opensuse.org/tests/4071829#dependencies
Let's try and take advantage of autoreview so we can see if/when the settings are changed again
Updated by mkittler 8 months ago
- Subject changed from Network broken with multimachine on multiple workers (broken packet forwarding / NAT) auto_review:"Test died.*curl -L openqa.opensuse.org.* failed at.*testapi":retry size:M to Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M
But not like this; this regex is probably way to generic considering we probably use curl and the o3 domain a lot.
Updated by mkittler 8 months ago
- Status changed from Feedback to Resolved
The arm and x86_64 scenarios are still looking good. As explained before we might have overlooked something but now the persistent and runtime config should be fine on all o3 tap x86_64 and arm worker hosts. So I'm considering this ticket resolved. If the interfaces appear again in the public zone at some point we will have to investigate what's changing the settings back but I wouldn't do that right now.
Updated by ggardet_arm 7 months ago
- Status changed from Resolved to Workable
- Priority changed from High to Urgent
Here we go again...
https://openqa.opensuse.org/tests/4098147#step/networking/15
Updated by okurz 7 months ago
for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --get-zone-of-interface=eth0"; done
shows
openqaworker21
trusted
openqaworker22
trusted
openqaworker23
public
openqaworker24
trusted
openqaworker25
trusted
openqaworker26
trusted
openqaworker-arm21
public
openqaworker-arm22
public
Updated by okurz 7 months ago
- Copied to action #159414: Ensure that os-autoinst-setup-multi-machine reliably sets firewall zones not interfering with /etc/sysconfig/network/ifcfg-* size:S added
Updated by livdywan 7 months ago · Edited
for i in $hosts; do echo $i && sshpass -p $password ssh -o PubkeyAuthentication=no -o PreferredAuthentications=password root@$i "firewall-cmd --permanent --zone=trusted --change-interface=eth0"; done
for i in $hosts; do echo $i && sshpass -p $password ssh -o PubkeyAuthentication=no -o PreferredAuthentications=password root@$i "firewall-cmd --get-zone-of-interface=eth0"; done
So that should resolve the zone for now, see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Manual-command-execution-on-o3-workers for the $hosts
Updated by ggardet_arm 7 months ago
- Priority changed from High to Urgent
livdywan wrote in #note-36:
for i in $hosts; do echo $i && sshpass -p $password ssh root@$i "firewall-cmd --permanent --zone=trusted --change-interface=eth0"; done
for i in $hosts; do echo $i && sshpass -p $password ssh root@$i "firewall-cmd --get-zone-of-interface=eth0"; doneSo that should resolve the zone for now, see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Manual-command-execution-on-o3-workers for the $hosts
Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19
Updated by livdywan 7 months ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
ggardet_arm wrote in #note-39:
Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19
Okay. Maybe this is a difference issue then. I'm taking another look.
Updated by ggardet_arm 7 months ago
livdywan wrote in #note-40:
ggardet_arm wrote in #note-39:
Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19Okay. Maybe this is a difference issue then. I'm taking another look.
Looks like openqaworker-arm21 vs openqaworker-arm22
Updated by livdywan 7 months ago · Edited
ggardet_arm wrote in #note-41:
livdywan wrote in #note-40:
ggardet_arm wrote in #note-39:
Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19Okay. Maybe this is a difference issue then. I'm taking another look.
Looks like openqaworker-arm21 vs openqaworker-arm22
Good catch! Apparently something once again reverted the zone on arm22 to public? Even though I'd just checked and fixed it... well, I fixed it once more. Let's see if this resolves the error. I'm keeping an eye on the worker as well this time - https://openqa.opensuse.org/tests/4099199#live is on arm22
Updated by mkittler 7 months ago · Edited
https://openqa.opensuse.org/tests/latest?arch=aarch64&distri=opensuse&flavor=DVD&machine=aarch64&test=ovs-server&version=Tumbleweed#next_previous and https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=wireguard_client&version=Tumbleweed#next_previous still look good.
I also created https://github.com/os-autoinst/os-autoinst/pull/2491 to tackle the source of the configuration problem.