Project

General

Profile

Actions

action #150920

closed

openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M

Added by ggardet_arm 6 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
2023-11-15
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

openQA test in scenario microos-Tumbleweed-DVD-aarch64-remote_ssh_controller@aarch64 fails in
await_install

Test suite description

Maintainer: jrivera Install remote server (parallel job) with ssh.

Reproducible

Fails since (at least) Build 20231114 (current job)

Expected result

Last good: 20231109 (or more recent)

Acceptance Criteria (should be implicit)

  • AC1: above scenario passes consistently again

Suggestions

  • Confirm that the test is working correctly i.e. it's not a test issue, not an issue with worker arm22
  • Check the parallel_failed case and not just the dependent job
  • Look further into what the scenario is testing
  • Check whether other MM scenarios work (that also rely on external network connectivity)

Further details

Always latest result in this scenario: latest


Related issues 7 (2 open5 closed)

Related to openQA Infrastructure - action #150845: openqaworker-arm22 broken due to packages automatically removed size:MResolvedmkittler2023-11-142023-11-29

Actions
Related to openQA Infrastructure - action #151231: package loss between o3 machines and download.opensuse.org size:MResolvednicksinger2023-11-21

Actions
Related to openQA Infrastructure - action #134123: Setup new PRG2 openQA worker for o3 - two new arm workers size:MResolvednicksinger

Actions
Related to openQA Project - action #155278: o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:MResolveddheidler2024-02-09

Actions
Related to openQA Tests - action #157414: Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:MFeedbackmkittler2024-03-182024-05-07

Actions
Has duplicate openQA Tests - action #150911: remote_{vnc,ssh}_controller: unable to refresh repo download.o.oResolvedokurz2023-11-15

Actions
Copied to openQA Project - action #150944: Consider showing the investigation tab for "parallel_failed" as wellNew2023-11-15

Actions
Actions #1

Updated by okurz 6 months ago

  • Tags set to infra, o3
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by okurz 6 months ago

  • Copied to action #150944: Consider showing the investigation tab for "parallel_failed" as well added
Actions #3

Updated by okurz 6 months ago

  • Subject changed from openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode to openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by livdywan 6 months ago

  • Has duplicate action #150911: remote_{vnc,ssh}_controller: unable to refresh repo download.o.o added
Actions #5

Updated by okurz 6 months ago

  • Related to action #150845: openqaworker-arm22 broken due to packages automatically removed size:M added
Actions #6

Updated by okurz 6 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz
Actions #7

Updated by okurz 6 months ago

  • Status changed from Blocked to In Progress

#150845 resolved, maybe the proper upgrade helps. triggered https://openqa.opensuse.org/tests/3733404#live

Actions #8

Updated by okurz 6 months ago

  • Tags changed from infra, o3 to o3
  • Status changed from In Progress to Workable
  • Assignee deleted (okurz)

https://openqa.opensuse.org/tests/3733404#step/await_install/148 reproduces the same problem so we can continue investigation there.

Actions #9

Updated by osukup 5 months ago · Edited

it looks more like general issue, x86_64 show same problem -> https://openqa.opensuse.org/tests/3742085#step/await_install/75 .. so its product or infra problem ? and if I look at history it started with 20231113, but in +- same time

Actions #10

Updated by nicksinger 5 months ago

  • Related to action #151231: package loss between o3 machines and download.opensuse.org size:M added
Actions #11

Updated by okurz 5 months ago

  • Status changed from Workable to Blocked
  • Assignee set to okurz

yes, likely "infra issue". Blocking on #151231

Actions #12

Updated by livdywan 5 months ago

okurz wrote in #note-11:

yes, likely "infra issue". Blocking on #151231

Effectively blocking on #151231 - I asked to clarify what we're waiting on there

Actions #13

Updated by okurz 5 months ago

#151231 resolved. Waiting for #134123 to crosscheck

Actions #14

Updated by ggardet_arm 5 months ago

Now, the problems are wider:

This breaks more parallel tests. For example ovs-client/ovs-server parallel test succeeded yesterday, but broke today: https://openqa.opensuse.org/tests/3764570#next_previous

Actions #15

Updated by okurz 5 months ago

  • Related to action #134123: Setup new PRG2 openQA worker for o3 - two new arm workers size:M added
Actions #16

Updated by okurz 5 months ago

I bumped prio of blocking #134123 now

Actions #17

Updated by okurz 5 months ago

  • Status changed from Blocked to Workable
  • Assignee deleted (okurz)

#134123 showed no more problems but https://openqa.opensuse.org/tests/3791292#step/ovs_client/32 still fails so we seem to have an unrelated problem here.

Actions #18

Updated by mkittler 5 months ago · Edited

  • Status changed from Workable to In Progress
  • Assignee set to mkittler

I retried the job (https://openqa.opensuse.org/tests/3791292#step/setup_multimachine/64) to see whether it wasn't just a random networking issue as the SUT has an IP and the correct MTU value.

I'll check for other possible mistakes in the MM setup on arm22.

EDIT: It failed again (https://openqa.opensuse.org/tests/3792767). Looks like no gre tunnels are setup. Maybe this is the problem but on the other hand there's no other aarch64 MM worker so this should be ok. It also looks like the firewall is blocking the port for the developer mode. This should all be fixed by running the MM setup script so I'll try that.

Actions #19

Updated by openqa_review 5 months ago

  • Due date set to 2023-12-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #20

Updated by nicksinger 5 months ago · Edited

  • Assignee changed from mkittler to nicksinger

as discussed in the daily I will continue here to cover/crosscheck AC1 of #134123

Actions #21

Updated by livdywan 5 months ago

  • Tags changed from o3 to o3, infra
Actions #22

Updated by nicksinger 5 months ago

I executed the script to setup multi-machine on arm21 as well as arm22. On 21 there was some previous br1 config which the script couldn't handle which I solved by removing /etc/sysconfig/network/ifcfg-br1. I later realized that $1 is used in the script to generate the gre_tunnel_preup script so I finally used this command:

instance=32 ethernet=eth0 os-autoinst-setup-multi-machine /etc/wicked/scripts/gre_tunnel_preup.sh

also I adjusted the gre-script to manually add a mutual connection between both workers. After a reboot I tested 2 MM jobs but both failed to connect to external hosts because of DNS errors so I have to now figure out if

a) The setup inside VMs is borked (I can make use of @mkittler commands to spawn a VM to investigate further)
b) How to actually fix the setup and if/what we're missing inside our setup-script

Actions #24

Updated by nicksinger 5 months ago

got a VM booting and some graphical output via VNC (needs to be tunneled via ssh because of firewalld). The command I used is:

qemu-system-aarch64 -m 2048 -enable-kvm -netdev tap,id=qanet0,ifname=tap10,script=no,downscript=no -device virtio-net,netdev=qanet0,mac=52:54:00:13:0b:4a -vnc :41,share=force-shared -cpu host -smp cores=4,threads=4 -monitor stdio -machine virt,gic-version=host -bios /usr/share/qemu/qemu-uefi-aarch64.bin -device virtio-gpu-pci,edid=on,xres=1024,yres=768 openSUSE-MicroOS.aarch64-kvm-and-xen.qcow2

compared to https://open.qa/docs/#_start_test_vms_manually I had to find a couple of additional options to make it work on aarch64. Until now I cannot type anything via VNC. I need to figure out how to fix that

Actions #25

Updated by okurz 5 months ago

  • Due date deleted (2023-12-23)
  • Status changed from In Progress to Resolved

we are trying to debug openQA multi-machine tests on openqaworker-arm21+arm22 and struggle to come up with a manual qemu command line that makes the machine boot. I tried e.g. /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci,edid=on,xres=1024,yres=768 -m 2048 -machine virt,gic-version=host -cpu host -mem-prealloc -mem-path /dev/hugepages/ -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:19:34:56 -enable-kvm -no-shutdown -vnc :91,share=force-shared but when connecting over VNC I only see that the guest has not initialized the display yet. Any ideas what's missing?

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815206 PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0

rescue_system@aarch64 -> https://openqa.opensuse.org/tests/3816230

and to crosscheck with warm21

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815206 PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21

Both tests start up just fine but the developer mode actually can not be connected over the webUI, for both of them. Crosschecked with another machine and https://openqa.opensuse.org/tests/3816232#live for openqaworker24 is just fine. Aso both tests are not related to multi-machine and actually don't configure network in the rescue mode those are not helping towards debugging network within the hosts but the question is why the hosts do not allow to reach the developer mode. Well, in the end mkittler fixed it. Likely wrong combinations of os-autoinst-setup-multi-machine and parameters were triggered.

So let's try developer mode and the only real multi-machine scenario on aarch64

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815268 PAUSE_AT=libzypp_config {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21

-> https://openqa.opensuse.org/tests/3816247

but I didn't follow up on the same day so eventually the job ran into timeout as expected.

In the end mkittler fixed the config at least on warm21 with

firewall-cmd --zone public --remove-interface=eth0
firewall-cmd --zone trusted --add-interface=eth0

I assume that nicksinger executed os-autoinst-setup-multi-machine but did not see (or ignore) error messages about the inability to add eth0 to trusted as it was already in public according to #150920-22 ?

Anyway now the developer mode works fine and so far also I assume multi-machine tests on warm21 so I added back "tap" to warm21:/etc/openqa/workers.ini and I am monitoring jobs.

Due to the new worker class there was a sudden rise of jobs and also jobs ending up incomplete with "Reason: cache failure: Cache service queue already full (5)" which is unfortunate. I now increased the cache size which is at least slightly related and might help. On both warm21+warm22 we have a 6TB NVMe from which we run most mount points including /var/lib/openqa/cache with enough free space so I set

# Limit size of CACHEDIRECTORY to the specified value in GiB (50 GiB by default)
CACHELIMIT = 1000

Multiple multi-machine tests now passed including

so I assume it's ok if we keep tap. Now to warm22.

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815268 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22
openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3818743 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22

yep, also fine.

So let's check across both warm21+warm22:

openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3818743 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS:ovs-server=openqaworker-arm22 WORKER_CLASS:ovs-client=openqaworker-arm21

Both successful so enabled "tap" for production on openqaworker-arm22 again as well. With that we can resolve here.

Actions #26

Updated by jbaier_cz 3 months ago

  • Related to action #155278: o3 aarch64 multi-machine tests on openqaworker-arm21 and 22 fail to resolve codecs.opensuse.org size:M added
Actions #27

Updated by okurz about 1 month ago

  • Related to action #157414: Network broken with multimachine on multiple workers (broken packet forwarding / NAT) size:M added
Actions

Also available in: Atom PDF