Project

General

Profile

Actions

action #155278

closed

o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:M

Added by okurz 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-02-09
Due date:
% Done:

0%

Estimated time:

Description

Observation

As reported by guillaume_g in https://matrix.to/#/!dRljORKAiNJcGEDbYA:opensuse.org/$-usg44GmRDhSKx4vOveeAHRRzQ9znlNJBjGedBCXIgs

it seems we have a regression on multimachine tests on aarch64: https://openqa.opensuse.org/tests/3927595#step/ovs_client/32

Acceptance criteria

  • AC1: Firewall configuration on openqaworker-arm21 and 22 is persistent (so MM setup works)

Suggestions

Open questions

Why was the interface in public zone again?

Why we are not failing inside setup multimachine test module, when nmcli network connectivity check clearly shows limited (it is supposed to be full) and the dig after is failing?

Related issues 3 (1 open2 closed)

Related to openQA Infrastructure - action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:MResolvednicksinger2023-11-15

Actions
Related to openQA Infrastructure - action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:MResolvedjbaier_cz2024-01-30

Actions
Related to openQA Tests - action #157414: Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:MFeedbackmkittler2024-03-18

Actions
Actions #1

Updated by jbaier_cz 3 months ago · Edited

I guess the main problem is already visible in https://openqa.opensuse.org/tests/3927886#step/setup_multimachine/67, unfortunately we do not have any assert in the setup.

dig +short server.openqa.test
;; communication error to 10.150.1.11#53: timed out

Same issue can be seen in other multimachine jobs as well, including for example https://openqa.opensuse.org/tests/3927606#step/yast2_nfs4_server/78

Access from the worker itself looks fine, it also seems to be affecting only openqaworker-arm21 and openqaworker-arm22.

It is also not happening in Build20240205, I do not see much changes in our codebase nor in the worker zypper log.

Actions #2

Updated by jbaier_cz 3 months ago

  • Related to action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added
Actions #3

Updated by jbaier_cz 3 months ago

  • Priority changed from Urgent to Normal

This issue is basically #150920 all over again. I reapplied steps from https://progress.opensuse.org/issues/150920#note-25 (the firewall zone change) and that solved the issue. I also restarted the affected jobs, the linked one already passed; hence lowering the prio.

Open questions:

  1. Why was the interface in public zone again?
  2. Why we are not failing inside setup multimachine test module, when nmcli network connectivity check clearly shows limited (it is supposed to be full) and the dig after is failing?
Actions #4

Updated by jbaier_cz 3 months ago

  • Related to action #154624: Periodically running simple ping-check multi-machine tests on x86_64 covering multiple physical hosts on OSD alerting tools team on failures size:M added
Actions #5

Updated by mkittler 2 months ago

  • Subject changed from o3 aarch64 multi-machine tests on openqaworker-arm22 fail to resolve codecs.opensuse.org to o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by dheidler 2 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler
Actions #7

Updated by dheidler 2 months ago

The reason why eth0 always wants to go back to the public zone:

openqaworker-arm21:~ # firewall-cmd --list-all-zones
…
trusted (active)
  target: ACCEPT
  icmp-block-inversion: no
  interfaces: br1 eth0 ovs-system tap0 tap1 …
…

openqaworker-arm21:~ # cat /etc/sysconfig/network/ifcfg-eth0
BOOTPROTO='dhcp'
STARTMODE='auto'
ZONE=public
Actions #8

Updated by dheidler 2 months ago

Updated /etc/sysconfig/network/ifcfg-eth0 to ZONE=trusted on both openqaworker-arm21 and 22.

Actions #9

Updated by dheidler 2 months ago

  • Status changed from In Progress to Feedback

nmcli network connectivity check always returns 0 - even when printing limited.

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18726 should fix that.

Actions #10

Updated by dheidler 2 months ago

  • Status changed from Feedback to Resolved
Actions #11

Updated by jbaier_cz about 1 month ago

  • Related to action #157414: Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M added
Actions

Also available in: Atom PDF