Project

General

Profile

Actions

action #159558

closed

network unreachable on aarch64-o3

Added by ggardet_arm 8 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Start date:
2024-04-24
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario microos-Tumbleweed-MicroOS-Image-ContainerHost-aarch64-container-host2microosnext@aarch64 fails in
zypper_ref

network is unreachable on aarch64-o3

Test suite description

Boot from the latest published MicroOS ContainerHost image and transactional-update dup to snapshot under test. Make sure to use %BUILD% in the URL and file name to force a redownload for new builds.

Reproducible

Fails since (at least) Build 20240421

Expected result

Last good: 20240418 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:MResolvedmkittler

Actions
Related to openQA Infrastructure (public) - action #133358: Migration of o3 VM to PRG2 - Ensure IPv6 is fully workingResolvedokurz

Actions
Actions #1

Updated by okurz 8 months ago

  • Related to action #150869: Ensure multi-machine tests work on aarch64-o3 (or another but single machine only) size:M added
Actions #2

Updated by okurz 8 months ago

  • Tags set to infra, reactive work
  • Assignee set to mkittler
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #3

Updated by mkittler 8 months ago ยท Edited

Looks like this can be reproduced even outside of a VM via e.g. curl 'http://openqa.opensuse.org/assets/repo/openSUSE-Tumbleweed-oss-aarch64-Snapshot20240423/repodata/repomd.xml'. But e.g. curl 'http://www.google.de' works so probably specific to reaching o3.

It works with HTTPs, though. Not sure why only HTTP ceased to work running the MM setup script.

Not sure how this worked before. For now I disabled the worker slots on aarch64-o3 by assigning a different worker class in /etc/openqa/workers.ini.


I'm wondering whether this scenario has ever worked on aarch64-o3 since it was moved to the FC basement (because I couldn't find a passing job in that scenario that actually ran on aarch64-o3). And yes, the migration to nginx probably made things worse as well.
And considering jobs like https://openqa.opensuse.org/tests/4103335#step/gnome_window_switcher/9 the network on aarch64-o3 is definitely not completely broken (also not inside VMs).
The output of this network connectivity check looks also good - and the test only fails later trying to access o3: https://openqa.opensuse.org/tests/4104487#step/hostname/25

So I think the problem is not that the entire network is unreachable on aarch64-o3 but only that http traffic to o3 doesn't work.

But it might still be related to the MM setup. The most recent job that still worked (and relied on repo refreshing) is from 5 days ago (before my changes): https://openqa.opensuse.org/tests/4094290#step/zypper_ref/18
It looks like there are jobs that ran after the MM setup but that could refresh repositories just fine, e.g. https://openqa.opensuse.org/tests/4102678#step/zypper_ref/15 (and it really used o3 and plain http, see https://openqa.opensuse.org/tests/4102678#step/zypper_ar/6).

Note that I also cannot reach o3 via http from my laptop (in VPN) or from backup-qam.qe.nue2.suse.org (which is in the neighboring rack). Only https works. I could reach o3 via http only from workers within the o3 network (e.g. arm1).

Looks like the NGINX config was modified around the time issues starting to come up:

Apr 24 09:33 /etc/nginx/vhosts.d/openqa.conf

Considering okurz pts/0 2a07:de40:b2bf:2 Wed Apr 24 09:33 still logged in it was maybe okurz who changed it. EDIT: It was this change: #133358#note-14

Actions #4

Updated by okurz 8 months ago

  • Related to action #133358: Migration of o3 VM to PRG2 - Ensure IPv6 is fully working added
Actions #5

Updated by mkittler 8 months ago

So probably the firewall (https://sd.suse.com/servicedesk/customer/portal/1/SD-128488) and not the MM setup. I keep the worker slots disabled anyway.

Actions #6

Updated by mkittler 8 months ago

  • Status changed from New to Feedback

tcp/80 allowed again on both IP stacks

It in fact works again (just tested via curl). So I enabled the production worker classes again.

Actions #7

Updated by livdywan 8 months ago

  • Priority changed from Urgent to High

mkittler wrote in #note-6:

tcp/80 allowed again on both IP stacks

It in fact works again (just tested via curl). So I enabled the production worker classes again.

I assume we consider it less urgent.

Actions #8

Updated by mkittler 8 months ago

Production jobs seem to work again, e.g. https://openqa.opensuse.org/tests/4106603#step/selinux_smoke/7. So I'm considering this resolved.

Actions #9

Updated by mkittler 8 months ago

  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF