action #157414: Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #157414

closed

Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M

Added by ggardet_arm about 1 year ago. Updated 12 months ago.

Status:

Resolved

Priority:

Low

Assignee:

mkittler

Category:

Bugs in existing tests

Target version:

openQA Project (public) - Ready

Start date:

2024-03-18

Due date:

2024-05-07

% Done:

Estimated time:

Difficulty:

Tags:

multi-machine, reactive work

Description

Observation¶

openQA test in scenario microos-Tumbleweed-DVD-aarch64-remote_ssh_controller@aarch64 fails in
await_install
possibly similar to what had happened in #150920 and #155278

Test suite description¶

Maintainer: jrivera Install remote server (parallel job) with ssh.

Reproducible¶

Fails since (at least) Build 20240314 (current job)

Expected result¶

Last good: 20240310 (or more recent)

Suggestions¶

Extend the existing test code as suggested in #157414-8 to have more explicit error messages
Lookup the history of tickets in #150920, #155278
Consider extending our setup multimachine script and potentially call it periodically?
Consider more explicit error checks in our worker code to prevent even running into such problems in openQA tests

Further details¶

Always latest result in this scenario: latest

Out of scope¶

Find a persistent solution, see #159414

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by okurz about 1 year ago

Tags set to reactive work
Target version set to Ready

Actions

Copy link

Updated by okurz about 1 year ago

Related to action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added

Actions

Copy link

Updated by jbaier_cz about 1 year ago

https://openqa.opensuse.org/tests/4018286#step/remote_controller/14 shows

[   63.657904] named[3159]: timed out resolving './DNSKEY/IN': 10.150.1.11#53
[   63.659073] named[3159]: network unreachable resolving './DNSKEY/IN': 2001:7fd::1#53
[   63.660290] named[3159]: network unreachable resolving './DNSKEY/IN': 2001:500:12::d0d#53
[   63.661531] named[3159]: network unreachable resolving './DNSKEY/IN': 2001:500:9f::42#53

Actions

Copy link

Updated by okurz about 1 year ago

Assignee set to jbaier_cz

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Related to action #155278: o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:M added

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Status changed from New to In Progress

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Status changed from In Progress to New
Priority changed from Urgent to Normal

This is #150920 yet again.

It looks like #155278#note-8 did not make a difference

openqaworker-arm21:~ #   cat /etc/sysconfig/network/ifcfg-eth0
BOOTPROTO='dhcp'
STARTMODE='auto'
ZONE=trusted

openqaworker-arm21:~ # firewall-cmd --list-all-zones
...
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0

Maybe something else is messing with the firewall configuration.

I reapplied #150920#note-25

firewall-cmd --zone public --remove-interface=eth0
firewall-cmd --zone trusted --add-interface=eth0

That should lower the severity. I will try to investigate what was wrong and try another permanent solution

Actions

Copy link

Updated by jbaier_cz about 1 year ago

Tags changed from reactive work to reactive work, multi-machine
Subject changed from Network broken with multimachine on openqaworker-arm22 to Network broken with multimachine on multiple workers (broken packet forwarding / NAT)
Assignee deleted (~~jbaier_cz~~)
Priority changed from Normal to High

So the issue is not limited to arm workers: https://openqa.opensuse.org/tests/4029113

We have the same symptoms and:

openqaworker23:~ # firewall-cmd --list-all-zones
...
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0

openqaworker23:~ #  cat /etc/sysconfig/network/ifcfg-eth0
BOOTPROTO='none'
STARTMODE='auto'
ZONE=public

I changed that to the trusted zone, but it did not fixed the issue there. The problem is clearly visible in the setup_multimachine: https://openqa.opensuse.org/tests/4029215#step/setup_multimachine/110 where the network manager already shows limited connectivity (maybe we should at least issue a warning in the test at this step).

On the other hand https://openqa.opensuse.org/tests/4029081 failed in the very same way on openqaworker25 where the interface is already in the correct zone. The last successful run https://openqa.opensuse.org/tests/4026044 was on openqaworker26. I tried to compare those workers but I am unable to find any difference in the network setup

So we might have an additional unknown problem here. Raising priority again and as I will be on a vacation, I am leaving this ticket free for someone to take over.

Actions

Copy link

Updated by okurz about 1 year ago

Subject changed from Network broken with multimachine on multiple workers (broken packet forwarding / NAT) to Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#10

Updated by mkittler about 1 year ago · Edited

We could lower the urgency by working around the problem via https://github.com/os-autoinst/openQA/pull/5536.

Looks like the latest run worked again and it ran across worker21 and 22. Runs from before across worker22 and 24 were passing as well. Do we know what set of workers is affected?

EDIT: I tried to figure out good vs. problematic workers via the job history:

good: 21, 22, 24
bad: 23, 26
unknown: 25

Although I'm not sure how much sense that makes considering 7 days ago even jobs across 23 and 26 passed (https://openqa.opensuse.org/tests/4025897). So this might not even be specific to certain hosts.

Actions

Copy link

#11

Updated by mkittler about 1 year ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

#12

Updated by mkittler about 1 year ago · Edited

/etc/wicked/scripts/gre_tunnel_preup.sh looks good on all workers. (w20 seems non-existent; w27, w28, arm21 and arm22 are not connected to the gre network at all; this leaves w21 to w26 as valid workers in accordance with how the tap worker class is assigned)
enough tap devices are configured on all hosts (30 slots, so the highest number would be tap158) and in the trusted zone according to firewall-cmd --get-active-zones - except on w22 where the public is also active and containing the tap interfaces
cat /proc/sys/net/ipv4/ip_forward also returns 1 on all hosts
for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --query-forward --zone=trusted" ; done returns "no" on all hosts; I would have expected a "yes" but it looks like we just don't configure forwarding via firewalld here.

I changed the zoning on w22 via for i in {0..178}; do firewall-cmd --remove-interface=tap$i --zone=public ; done and for i in {0..178}; do firewall-cmd --add-interface=tap$i --zone=trusted ; done. Now firewall-cmd --get-active-zones only shows trusted anymore (with all tap devices). Unfortunately I was distracted so there was a delay of 2 hours between those two commands. It doesn't look like this caused much disruption, though. I also ran the same commands with --permanent but got warnings like Warning: ALREADY_ENABLED: tap94 so this was likely not necessary.

Actions

Copy link

#13

Updated by openqa_review about 1 year ago

Due date set to 2024-04-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#14

Updated by mkittler about 1 year ago

The Next & Previous tab in the intial (arm) scenario looks very good. Recent jobs in the (x86_64) wireguard scenario (https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=wireguard_client&version=Tumbleweed) are also passing. I'll keep observing the situation but would not workaround the problem right now (by forcing jobs to run on a single worker) as it doesn't seem very problematic right now.

Actions

Copy link

#15

Updated by okurz about 1 year ago

I suggest to reboot multiple machines and see if that changes the settings

Actions

Copy link

#16

Updated by mkittler about 1 year ago

On w22 the zone setting still revered after I've just rebooted the machine. So I ran for i in {0..178}; do firewall-cmd --permanent --remove-interface=tap$i --zone=public ; firewall-cmd --permanent --add-interface=tap$i --zone=trusted ; done and rebooted the machine. Now the zone setting is in fact persistent. (That means previously the "permanent" setting was wrong on w22.)

I also rebooted on of the other hosts to see whether that changes anything but it seems not the case. So I assume only w22 was wrongly configured.

Note that w22 being problematic could even explain failures seen in clusters that only ran across other hosts due to our way of connecting each host with each other host.

Actions

Copy link

#17

Updated by mkittler about 1 year ago

Status changed from In Progress to Resolved

Both scenarios still look good. So I'm resolving this ticket now because I wouldn't know what else to improve. Probably the wrong setup on w22 was the culprit (besides the initial problem on arm workers which has already been fixed anyway).

Actions

Copy link

#18

Updated by okurz about 1 year ago

Status changed from Resolved to Feedback

From https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$ofSOPVkvl0cltkJ28p2DkZ3oxVKuL9OBiEuLO4C5xyw

We have again issues with network and multi machine : https://openqa.opensuse.org/tests/4069622#step/networking/15

Could be #155278 as well.

Actions

Copy link

#19

Updated by okurz about 1 year ago

Due date deleted (~~2024-04-11~~)

Actions

Copy link

#20

Updated by okurz about 1 year ago

Status changed from Feedback to Workable
Assignee deleted (~~mkittler~~)
Priority changed from High to Urgent

Actions

Copy link

#21

Updated by jbaier_cz about 1 year ago · Edited

Priority changed from Urgent to High

# firewall-cmd --list-all-zones
...
public (active)
  target: default
  icmp-block-inversion: no
  interfaces: eth0
  sources:
  services: dhcpv6-client ssh
  ports:
  protocols:
  forward: no
  masquerade: no
  forward-ports:
  source-ports:
  icmp-blocks:
  rich rules:

I executed firewall-cmd --permanent --zone=public --remove-interface=eth0; firewall-cmd --permanent --zone=trusted --add-interface=eth0 which should at least lower the severity again as that should fix the tests.

Actions

Copy link

#22

Updated by ggardet_arm about 1 year ago

Still broken: https://openqa.opensuse.org/tests/4070107#step/networking/15

Actions

Copy link

#23

Updated by okurz about 1 year ago

Priority changed from High to Urgent

Actions

Copy link

#24

Updated by mkittler about 1 year ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

#25

Updated by mkittler about 1 year ago · Edited

I now also updated the runtime and permanent configuration (as it looks like only the permanent configuration was updated by @jbaier_cz). The persistent configuration looked good except that on arm21 the eth0 interface was in two zones at once which I fixed (leading to firewalld complaining about it: Error: INVALID_ZONE: public trusted (ERROR: interface 'eth0' is in 2 zone XML files, can be only in one)).

The config only appeared broken on the arm workers and on openqaworker23. Not sure why eth0 was moved again to the public zone there.

I restarted the failing job, let's see whether it worked: https://openqa.opensuse.org/tests/4071829#dependencies

EDIT: The tests are now passing again. So it was really just the firewalld settings again. Last time I only considered x86_64 hosts (as @jbaier_cz had already taken care of arm hosts before) and also checked only one other host's persistent config by rebooting. So probably I just missed updating the persistent config of some hosts.

So I've now just run for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --permanent --zone=trusted --change-interface=eth0" ; done to make sure the persistent config on all hosts is correct (and for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --zone=trusted --change-interface=eth0" ; done to see whether it actually worked).

Actions

Copy link

#26

Updated by mkittler about 1 year ago

Status changed from In Progress to Feedback
Priority changed from Urgent to High

With that I hope we can resolve this ticket again.

Actions

Copy link

#27

Updated by livdywan about 1 year ago · Edited

Subject changed from Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M to Network broken with multimachine on multiple workers (broken packet forwarding / NAT) auto_review:"Test died.*curl -L openqa.opensuse.org.* failed at.*testapi":retry size:M

I restarted the failing job, let's see whether it worked: https://openqa.opensuse.org/tests/4071829#dependencies

Let's try and take advantage of autoreview so we can see if/when the settings are changed again

Actions

Copy link

#28

Updated by mkittler about 1 year ago

Subject changed from Network broken with multimachine on multiple workers (broken packet forwarding / NAT) auto_review:"Test died.*curl -L openqa.opensuse.org.* failed at.*testapi":retry size:M to Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M

But not like this; this regex is probably way to generic considering we probably use curl and the o3 domain a lot.

Actions

Copy link

#29

Updated by mkittler about 1 year ago

Status changed from Feedback to Resolved

The arm and x86_64 scenarios are still looking good. As explained before we might have overlooked something but now the persistent and runtime config should be fine on all o3 tap x86_64 and arm worker hosts. So I'm considering this ticket resolved. If the interfaces appear again in the public zone at some point we will have to investigate what's changing the settings back but I wouldn't do that right now.

Actions

Copy link

#30

Updated by okurz about 1 year ago

Due date set to 2024-04-25
Status changed from Resolved to Feedback

Please just explicitly monitor jobs that had been running over the weekend or any next Tumbleweed aarch64 jobs as that showed problems in the past.

Actions

Copy link

#31

Updated by mkittler about 1 year ago

Status changed from Feedback to Resolved

The wireguard scenario had no failures since then anymore on both archs. I also checked other MM scenarios of the latest TW build and none failed (at least not like in this issue).

Actions

Copy link

#32

Updated by ggardet_arm 12 months ago

Status changed from Resolved to Workable
Priority changed from High to Urgent

Here we go again...

https://openqa.opensuse.org/tests/4098147#step/networking/15

Actions

Copy link

#33

Updated by okurz 12 months ago

Due date deleted (~~2024-04-25~~)
Assignee deleted (~~mkittler~~)

Actions

Copy link

#34

Updated by okurz 12 months ago

for i in $hosts; do echo $i && ssh root@$i "firewall-cmd --get-zone-of-interface=eth0"; done shows

openqaworker21
trusted
openqaworker22
trusted
openqaworker23
public
openqaworker24
trusted
openqaworker25
trusted
openqaworker26
trusted
openqaworker-arm21
public
openqaworker-arm22
public

Actions

Copy link

#35

Updated by okurz 12 months ago

Copied to action #159414: Ensure that os-autoinst-setup-multi-machine reliably sets firewall zones not interfering with /etc/sysconfig/network/ifcfg-* size:S added

Actions

Copy link

#36

Updated by livdywan 12 months ago · Edited

for i in $hosts; do echo $i && sshpass -p $password ssh -o PubkeyAuthentication=no -o PreferredAuthentications=password root@$i "firewall-cmd --permanent --zone=trusted --change-interface=eth0"; done
for i in $hosts; do echo $i && sshpass -p $password ssh -o PubkeyAuthentication=no -o PreferredAuthentications=password root@$i "firewall-cmd --get-zone-of-interface=eth0"; done

So that should resolve the zone for now, see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Manual-command-execution-on-o3-workers for the $hosts

Actions

Copy link

#37

Updated by livdywan 12 months ago

Priority changed from Urgent to High

So I assume #159414 is the proper fix, but not sure this ticket can be resolved so long as something keeps changing the config under our feet?

Actions

Copy link

#38

Updated by livdywan 12 months ago

Btw I also checked cat /etc/sysconfig/network/ifcfg-eth0 and it includes trusted for all hosts

Actions

Copy link

#39

Updated by ggardet_arm 12 months ago

Priority changed from High to Urgent

livdywan wrote in #note-36:

for i in $hosts; do echo $i && sshpass -p $password ssh root@$i "firewall-cmd --permanent --zone=trusted --change-interface=eth0"; done
for i in $hosts; do echo $i && sshpass -p $password ssh root@$i "firewall-cmd --get-zone-of-interface=eth0"; done

So that should resolve the zone for now, see https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Manual-command-execution-on-o3-workers for the $hosts

Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19

Actions

Copy link

#40

Updated by livdywan 12 months ago

Status changed from Workable to In Progress
Assignee set to livdywan

ggardet_arm wrote in #note-39:

Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19

Okay. Maybe this is a difference issue then. I'm taking another look.

Actions

Copy link

#41

Updated by ggardet_arm 12 months ago

livdywan wrote in #note-40:

ggardet_arm wrote in #note-39:

Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19

Okay. Maybe this is a difference issue then. I'm taking another look.

Looks like openqaworker-arm21 vs openqaworker-arm22

Actions

Copy link

#42

Updated by livdywan 12 months ago · Edited

ggardet_arm wrote in #note-41:

livdywan wrote in #note-40:

ggardet_arm wrote in #note-39:

Still failing, see:
https://openqa.opensuse.org/tests/4098745#step/ovs_server/19

Okay. Maybe this is a difference issue then. I'm taking another look.

Looks like openqaworker-arm21 vs openqaworker-arm22

Good catch! Apparently something once again reverted the zone on arm22 to public? Even though I'd just checked and fixed it... well, I fixed it once more. Let's see if this resolves the error. I'm keeping an eye on the worker as well this time - https://openqa.opensuse.org/tests/4099199#live is on arm22

Actions

Copy link

#43

Updated by livdywan 12 months ago

Description updated (diff)

Actions

Copy link

#44

Updated by livdywan 12 months ago

Status changed from In Progress to Resolved

Jobs on multiple workers are passing now. I once more confirmed that all $hosts report that they are in the trusted zone. Let's see how long this lasts 🤞

Actions

Copy link

#45

Updated by okurz 12 months ago

Status changed from Resolved to Feedback
Priority changed from Urgent to High

No, we should more actively monitor ourselves as this was reopened often enough already without us looking into the real issues

Actions

Copy link

#46

Updated by livdywan 12 months ago

Status changed from Feedback to Workable
Assignee deleted (~~livdywan~~)

My understanding was that's covered in other tickets and we are not aware of other issues. I guess I'll put it back in the queue for now.

Actions

Copy link

#47

Updated by mkittler 12 months ago · Edited

https://openqa.opensuse.org/tests/latest?arch=aarch64&distri=opensuse&flavor=DVD&machine=aarch64&test=ovs-server&version=Tumbleweed#next_previous and https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=DVD&machine=64bit&test=wireguard_client&version=Tumbleweed#next_previous still look good.

I also created https://github.com/os-autoinst/os-autoinst/pull/2491 to tackle the source of the configuration problem.

Actions

Copy link

#48

Updated by mkittler 12 months ago

Status changed from Workable to Feedback
Assignee set to mkittler

Actions

Copy link

#49

Updated by mkittler 12 months ago

The scenarios still look good. I'll close this ticket on Monday if it still looks good then.

Actions

Copy link

#50

Updated by okurz 12 months ago

Please monitor for longer as we overlooked multiple problems and shouldn't need to rely only on ggardet to inform us about related problems.

Actions

Copy link

#51

Updated by mkittler 12 months ago

Ok, still looks good at this point. For how many days should I look into it then?

Actions

Copy link

#52

Updated by okurz 12 months ago

Due date set to 2024-05-07
Priority changed from High to Normal

I suggest until next Monday

Actions

Copy link

#53

Updated by mkittler 12 months ago

Priority changed from Normal to Low

Actions

Copy link

#54

Updated by mkittler 12 months ago

Status changed from Feedback to Resolved

Now is next Monday and the scenarios still look good. So I'm considering this resolved.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #157414

Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Suggestions¶

Further details¶

Out of scope¶

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by okurz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by jbaier_cz about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago · Edited

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago · Edited

Updated by openqa_review about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by jbaier_cz about 1 year ago · Edited

Updated by ggardet_arm about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago · Edited

Updated by mkittler about 1 year ago

Updated by livdywan about 1 year ago · Edited

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago

Updated by ggardet_arm 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by okurz 12 months ago

Updated by livdywan 12 months ago · Edited

Updated by livdywan 12 months ago

Updated by livdywan 12 months ago

Updated by ggardet_arm 12 months ago

Updated by livdywan 12 months ago

Updated by ggardet_arm 12 months ago

Updated by livdywan 12 months ago · Edited

Updated by livdywan 12 months ago

Updated by livdywan 12 months ago

Updated by okurz 12 months ago

Updated by livdywan 12 months ago

Updated by mkittler 12 months ago · Edited

Updated by mkittler 12 months ago

Updated by mkittler 12 months ago

Updated by okurz 12 months ago

Updated by mkittler 12 months ago

Updated by okurz 12 months ago

Updated by mkittler 12 months ago

Updated by mkittler 12 months ago