Project

General

Profile

Actions

action #157414

open

Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M

Added by ggardet_arm about 1 month ago. Updated 4 days ago.

Status:
Feedback
Priority:
High
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2024-03-18
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario microos-Tumbleweed-DVD-aarch64-remote_ssh_controller@aarch64 fails in
await_install
possibly similar to what had happened in #150920 and #155278

Test suite description

Maintainer: jrivera Install remote server (parallel job) with ssh.

Reproducible

Fails since (at least) Build 20240314 (current job)

Expected result

Last good: 20240310 (or more recent)

Suggestions

  • Extend the existing test code as suggested in #157414-8 to have more explicit error messages
  • Lookup the history of tickets in #150920, #155278
  • Consider extending our setup multimachine script and potentially call it periodically?
  • Consider more explicit error checks in our worker code to prevent even running into such problems in openQA tests

Further details

Always latest result in this scenario: latest

Out of scope

  • Find a persistent solution, see #159414

Related issues 3 (1 open2 closed)

Related to openQA Infrastructure - action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:MResolvednicksinger2023-11-15

Actions
Related to openQA Project - action #155278: o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:MResolveddheidler2024-02-09

Actions
Copied to openQA Project - action #159414: Ensure that os-autoinst-setup-multi-machine reliably sets firewall zones not interfering with /etc/sysconfig/network/ifcfg-*New2024-03-18

Actions
Actions #1

Updated by okurz about 1 month ago

  • Tags set to reactive work
  • Target version set to Ready
Actions #2

Updated by okurz about 1 month ago

  • Related to action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added
Actions #3

Updated by jbaier_cz about 1 month ago

https://openqa.opensuse.org/tests/4018286#step/remote_controller/14 shows

[   63.657904] named[3159]: timed out resolving './DNSKEY/IN': 10.150.1.11#53
[   63.659073] named[3159]: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53
[   63.660290] named[3159]: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53
[   63.661531] named[3159]: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53
Actions #4

Updated by okurz about 1 month ago

  • Assignee set to jbaier_cz
Actions #5

Updated by jbaier_cz about 1 month ago

  • Related to action #155278: o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:M added
Actions #6

Updated by jbaier_cz about 1 month ago

  • Status changed from New to In Progress
Actions #7

Updated by jbaier_cz about 1 month ago

  • Status changed from In Progress to New
  • Priority changed from Urgent to Normal

This is #150920 yet again.

It looks like #155278#note-8 did not make a difference

openqaworker-arm21:~ #   cat /etc/sysconfig/network/ifcfg-eth0
BOOTPROTO='dhcp'
STARTMODE='auto'
ZONE=trusted
openqaworker-arm21:~ # firewall-cmd --list-all-zones
...
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0

Maybe something else is messing with the firewall configuration.

I reapplied #150920#note-25

firewall-cmd --zone public --remove-interface=eth0
firewall-cmd --zone trusted --add-interface=eth0

That should lower the severity. I will try to investigate what was wrong and try another permanent solution

Actions #8

Updated by jbaier_cz about 1 month ago

  • Tags changed from reactive work to reactive work, multi-machine
  • Subject changed from Network broken with multimachine on openqaworker-arm22 to Network broken with multimachine on multiple workers (broken packet forwarding / NAT)
  • Assignee deleted (jbaier_cz)
  • Priority changed from Normal to High

So the issue is not limited to arm workers: https://openqa.opensuse.org/tests/4029113

We have the same symptoms and:

openqaworker23:~ # firewall-cmd --list-all-zones
...
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0

openqaworker23:~ #  cat /etc/sysconfig/network/ifcfg-eth0
BOOTPROTO='none'
STARTMODE='auto'
ZONE=public

I changed that to the trusted zone, but it did not fixed the issue there. The problem is clearly visible in the setup_multimachine: https://openqa.opensuse.org/tests/4029215#step/setup_multimachine/110 where the network manager already shows limited connectivity (maybe we should at least issue a warning in the test at this step).

On the other hand https://openqa.opensuse.org/tests/4029081 failed in the very same way on openqaworker25 where the interface is already in the correct zone. The last successful run https://openqa.opensuse.org/tests/4026044 was on openqaworker26. I tried to compare those workers but I am unable to find any difference in the network setup

So we might have an additional unknown problem here. Raising priority again and as I will be on a vacation, I am leaving this ticket free for someone to take over.

Actions #9

Updated by okurz about 1 month ago

  • Subject changed from Network broken with multimachine on multiple workers (broken packet forwarding / NAT) to Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #10

Updated by mkittler about 1 month ago · Edited

We could lower the urgency by working around the problem via https://github.com/os-autoinst/openQA/pull/5536.

Looks like the latest run worked again and it ran across worker21 and 22. Runs from before across worker22 and 24 were passing as well. Do we know what set of workers is affected?

EDIT: I tried to figure out good vs. problematic workers via the job history:

good: 21, 22, 24
bad: 23, 26
unknown: 25

Although I'm not sure how much sense that makes considering 7 days ago even jobs across 23 and 26 passed (https://openqa.opensuse.org/tests/4025897). So this might not even be specific to certain hosts.

Actions #11

Updated by mkittler about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #12

Updated by mkittler about 1 month ago · Edited

  • /etc/wicked/scripts/gre_tunnel_preup.sh looks good on all workers. (w20 seems non-existent; w27, w28, arm21 and arm22 are not connected to the gre network at all; this leaves w21 to w26 as valid workers in accordance with how the tap worker class is assigned)
  • enough tap devices are configured on all hosts (30 slots, so the highest number would be tap158) and in the trusted zone according to firewall-cmd --get-active-zones - except on w22 where the public is also active and containing the tap interfaces
  • cat /proc/sys/net/ipv4/ip_forward also returns 1 on all hosts
  • for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --query-forward --zone=trusted" ; done returns "no" on all hosts; I would have expected a "yes" but it looks like we just don't configure forwarding via firewalld here.

I changed the zoning on w22 via for i in {0..178}; do firewall-cmd --remove-interface=tap$i --zone=public ; done and for i in {0..178}; do firewall-cmd --add-interface=tap$i --zone=trusted ; done. Now firewall-cmd --get-active-zones only shows trusted anymore (with all tap devices). Unfortunately I was distracted so there was a delay of 2 hours between those two commands. It doesn't look like this caused much disruption, though. I also ran the same commands with --permanent but got warnings like Warning: ALREADY_ENABLED: tap94 so this was likely not necessary.

Actions #13

Updated by openqa_review about 1 month ago

  • Due date set to 2024-04-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by mkittler about 1 month ago

The Next & Previous tab in the intial (arm) scenario looks very good. Recent jobs in the (x86_64) wireguard scenario (https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=wireguard_client&version=Tumbleweed) are also passing. I'll keep observing the situation but would not workaround the problem right now (by forcing jobs to run on a single worker) as it doesn't seem very problematic right now.

Actions #15

Updated by okurz about 1 month ago

I suggest to reboot multiple machines and see if that changes the settings

Actions #16

Updated by mkittler 27 days ago

On w22 the zone setting still revered after I've just rebooted the machine. So I ran for i in {0..178}; do firewall-cmd --permanent --remove-interface=tap$i --zone=public ; firewall-cmd --permanent --add-interface=tap$i --zone=trusted ; done and rebooted the machine. Now the zone setting is in fact persistent. (That means previously the "permanent" setting was wrong on w22.)

I also rebooted on of the other hosts to see whether that changes anything but it seems not the case. So I assume only w22 was wrongly configured.

Note that w22 being problematic could even explain failures seen in clusters that only ran across other hosts due to our way of connecting each host with each other host.

Actions #17

Updated by mkittler 27 days ago

  • Status changed from In Progress to Resolved

Both scenarios still look good. So I'm resolving this ticket now because I wouldn't know what else to improve. Probably the wrong setup on w22 was the culprit (besides the initial problem on arm workers which has already been fixed anyway).

Actions #18

Updated by okurz 21 days ago

  • Status changed from Resolved to Feedback
Actions #19

Updated by okurz 21 days ago

  • Due date deleted (2024-04-11)
Actions #20

Updated by okurz 21 days ago

  • Status changed from Feedback to Workable
  • Assignee deleted (mkittler)
  • Priority changed from High to Urgent
Actions #21

Updated by jbaier_cz 21 days ago · Edited

  • Priority changed from Urgent to High
# firewall-cmd --list-all-zones
...
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0
  sources:
  services: dhcpv6-client ssh
  ports:
  protocols:
  forward: no
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

I executed firewall-cmd --permanent --zone=public --remove-interface=eth0; firewall-cmd --permanent --zone=trusted --add-interface=eth0 which should at least lower the severity again as that should fix the tests.

Actions #23

Updated by okurz 21 days ago

  • Priority changed from High to Urgent
Actions #24

Updated by mkittler 21 days ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #25

Updated by mkittler 21 days ago · Edited

I now also updated the runtime and permanent configuration (as it looks like only the permanent configuration was updated by @jbaier_cz). The persistent configuration looked good except that on arm21 the eth0 interface was in two zones at once which I fixed (leading to firewalld complaining about it: Error: INVALID_ZONE: public trusted (ERROR: interface 'eth0' is in 2 zone XML files, can be only in one)).

The config only appeared broken on the arm workers and on openqaworker23. Not sure why eth0 was moved again to the public zone there.

I restarted the failing job, let's see whether it worked: https://openqa.opensuse.org/tests/4071829#dependencies


EDIT: The tests are now passing again. So it was really just the firewalld settings again. Last time I only considered x86_64 hosts (as @jbaier_cz had already taken care of arm hosts before) and also checked only one other host's persistent config by rebooting. So probably I just missed updating the persistent config of some hosts.

So I've now just run for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --permanent --zone=trusted --change-interface=eth0" ; done to make sure the persistent config on all hosts is correct (and for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --zone=trusted --change-interface=eth0" ; done to see whether it actually worked).

Actions #26

Updated by mkittler 21 days ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High

With that I hope we can resolve this ticket again.

Actions #27

Updated by livdywan 20 days ago · Edited

  • Subject changed from Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M to Network broken with multimachine on multiple workers (broken packet forwarding / NAT) auto_review:"Test died.*curl -L openqa.opensuse.org.* failed at.*testapi":retry size:M

I restarted the failing job, let's see whether it worked: https://openqa.opensuse.org/tests/4071829#dependencies

Let's try and take advantage of autoreview so we can see if/when the settings are changed again

Actions #28

Updated by mkittler 20 days ago

  • Subject changed from Network broken with multimachine on multiple workers (broken packet forwarding / NAT) auto_review:"Test died.*curl -L openqa.opensuse.org.* failed at.*testapi":retry size:M to Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M

But not like this; this regex is probably way to generic considering we probably use curl and the o3 domain a lot.

Actions #29

Updated by mkittler 19 days ago

  • Status changed from Feedback to Resolved

The arm and x86_64 scenarios are still looking good. As explained before we might have overlooked something but now the persistent and runtime config should be fine on all o3 tap x86_64 and arm worker hosts. So I'm considering this ticket resolved. If the interfaces appear again in the public zone at some point we will have to investigate what's changing the settings back but I wouldn't do that right now.

Actions #30

Updated by okurz 15 days ago

  • Due date set to 2024-04-25
  • Status changed from Resolved to Feedback

Please just explicitly monitor jobs that had been running over the weekend or any next Tumbleweed aarch64 jobs as that showed problems in the past.

Actions #31

Updated by mkittler 14 days ago

  • Status changed from Feedback to Resolved

The wireguard scenario had no failures since then anymore on both archs. I also checked other MM scenarios of the latest TW build and none failed (at least not like in this issue).

Actions #32

Updated by ggardet_arm 7 days ago

  • Status changed from Resolved to Workable
  • Priority changed from High to Urgent
Actions #33

Updated by okurz 7 days ago

  • Due date deleted (2024-04-25)
  • Assignee deleted (mkittler)
Actions #34

Updated by okurz 7 days ago

for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --get-zone-of-interface=eth0"; done shows

openqaworker21
trusted
openqaworker22
trusted
openqaworker23
public
openqaworker24
trusted
openqaworker25
trusted
openqaworker26
trusted
openqaworker-arm21
public
openqaworker-arm22
public
Actions #35

Updated by okurz 7 days ago

  • Copied to action #159414: Ensure that os-autoinst-setup-multi-machine reliably sets firewall zones not interfering with /etc/sysconfig/network/ifcfg-* added
Actions #36

Updated by livdywan 7 days ago · Edited

for i in $hosts; do echo $i && sshpass -p $password ssh -o PubkeyAuthentication=no -o PreferredAuthentications=password root@$i "firewall-cmd --permanent --zone=trusted --change-interface=eth0"; done
for i in $hosts; do echo $i && sshpass -p $password ssh -o PubkeyAuthentication=no -o PreferredAuthentications=password root@$i "firewall-cmd --get-zone-of-interface=eth0"; done

So that should resolve the zone for now, see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Manual-command-execution-on-o3-workers for the $hosts

Actions #37

Updated by livdywan 7 days ago

  • Priority changed from Urgent to High

So I assume #159414 is the proper fix, but not sure this ticket can be resolved so long as something keeps changing the config under our feet?

Actions #38

Updated by livdywan 7 days ago

Btw I also checked cat /etc/sysconfig/network/ifcfg-eth0 and it includes trusted for all hosts

Actions #39

Updated by ggardet_arm 7 days ago

  • Priority changed from High to Urgent

livdywan wrote in #note-36:

for i in $hosts; do echo $i && sshpass -p $password ssh root@$i "firewall-cmd --permanent --zone=trusted --change-interface=eth0"; done
for i in $hosts; do echo $i && sshpass -p $password ssh root@$i "firewall-cmd --get-zone-of-interface=eth0"; done

So that should resolve the zone for now, see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Manual-command-execution-on-o3-workers for the $hosts

Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19

Actions #40

Updated by livdywan 7 days ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

ggardet_arm wrote in #note-39:

Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19

Okay. Maybe this is a difference issue then. I'm taking another look.

Actions #41

Updated by ggardet_arm 7 days ago

livdywan wrote in #note-40:

ggardet_arm wrote in #note-39:

Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19

Okay. Maybe this is a difference issue then. I'm taking another look.

Looks like openqaworker-arm21 vs openqaworker-arm22

Actions #42

Updated by livdywan 7 days ago · Edited

ggardet_arm wrote in #note-41:

livdywan wrote in #note-40:

ggardet_arm wrote in #note-39:

Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19

Okay. Maybe this is a difference issue then. I'm taking another look.

Looks like openqaworker-arm21 vs openqaworker-arm22

Good catch! Apparently something once again reverted the zone on arm22 to public? Even though I'd just checked and fixed it... well, I fixed it once more. Let's see if this resolves the error. I'm keeping an eye on the worker as well this time - https://openqa.opensuse.org/tests/4099199#live is on arm22

Actions #43

Updated by livdywan 7 days ago

  • Description updated (diff)
Actions #44

Updated by livdywan 7 days ago

  • Status changed from In Progress to Resolved

Jobs on multiple workers are passing now. I once more confirmed that all $hosts report that they are in the trusted zone. Let's see how long this lasts 🤞

Actions #45

Updated by okurz 6 days ago

  • Status changed from Resolved to Feedback
  • Priority changed from Urgent to High

No, we should more actively monitor ourselves as this was reopened often enough already without us looking into the real issues

Actions #46

Updated by livdywan 6 days ago

  • Status changed from Feedback to Workable
  • Assignee deleted (livdywan)

My understanding was that's covered in other tickets and we are not aware of other issues. I guess I'll put it back in the queue for now.

Actions #48

Updated by mkittler 5 days ago

  • Status changed from Workable to Feedback
  • Assignee set to mkittler
Actions #49

Updated by mkittler 4 days ago

The scenarios still look good. I'll close this ticket on Monday if it still looks good then.

Actions #50

Updated by okurz 4 days ago

Please monitor for longer as we overlooked multiple problems and shouldn't need to rely only on ggardet to inform us about related problems.

Actions

Also available in: Atom PDF