action #155278
closedo3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:M
0%
Description
Observation¶
As reported by guillaume_g in https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$-usg44GmRDhSKx4vOveeAHRRzQ9znlNJBjGedBCXIgs
it seems we have a regression on multimachine tests on aarch64: https://openqa.opensuse.org/tests/3927595#step/ovs_client/32
Acceptance criteria¶
- AC1: Firewall configuration on openqaworker-arm21 and 22 is persistent (so MM setup works)
Suggestions¶
- Probably #150920 again
- Reapplying steps from #150920#note-25 (the firewall zone change) and that solved the issue before
- Restart the affected jobs
- Create an according test module improvement request in https://progress.opensuse.org/projects/openqatests/issues/new or improve the ping test module yourself so it actually fails in case of a problem
Open questions¶
Why was the interface in public zone again?
Why we are not failing inside setup multimachine test module, when nmcli network connectivity check clearly shows limited (it is supposed to be full) and the dig after is failing?
Updated by jbaier_cz 8 months ago · Edited
I guess the main problem is already visible in https://openqa.opensuse.org/tests/3927886#step/setup_multimachine/67, unfortunately we do not have any assert in the setup.
dig +short server.openqa.test
;; communication error to 10.150.1.11#53: timed out
Same issue can be seen in other multimachine jobs as well, including for example https://openqa.opensuse.org/tests/3927606#step/yast2_nfs4_server/78
Access from the worker itself looks fine, it also seems to be affecting only openqaworker-arm21 and openqaworker-arm22.
It is also not happening in Build20240205, I do not see much changes in our codebase nor in the worker zypper log.
Updated by jbaier_cz 8 months ago
- Related to action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added
Updated by jbaier_cz 8 months ago
- Priority changed from Urgent to Normal
This issue is basically #150920 all over again. I reapplied steps from https://progress.opensuse.org/issues/150920#note-25 (the firewall zone change) and that solved the issue. I also restarted the affected jobs, the linked one already passed; hence lowering the prio.
Open questions:
- Why was the interface in public zone again?
- Why we are not failing inside setup multimachine test module, when
nmcli network connectivity check
clearly shows limited (it is supposed to be full) and thedig
after is failing?
Updated by jbaier_cz 8 months ago
- Related to action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:M added
Updated by mkittler 8 months ago
- Subject changed from o3 aarch64 multi-machine tests on openqaworker-arm22 fail to resolve codecs.opensuse.org to o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by dheidler 8 months ago
The reason why eth0 always wants to go back to the public zone:
openqaworker-arm21:~ # firewall-cmd --list-all-zones
…
trusted (active)
target: ACCEPT
icmp-block-inversion: no
interfaces: br1 eth0 ovs-system tap0 tap1 …
…
openqaworker-arm21:~ # cat /etc/sysconfig/network/ifcfg-eth0
BOOTPROTO='dhcp'
STARTMODE='auto'
ZONE=public
Updated by dheidler 8 months ago
- Status changed from In Progress to Feedback
nmcli network connectivity check
always returns 0
- even when printing limited
.
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18726 should fix that.
Updated by jbaier_cz 7 months ago
- Related to action #157414: Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M added