Project

General

Profile

Actions

action #150911

closed

remote_{vnc,ssh}_controller: unable to refresh repo download.o.o

Added by dimstar 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2023-11-15
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-remote_vnc_controller@64bit fails in
await_install

The test tries endlessly to refresh the download.o.o/update/tumbleweed repo - constantly returning repo not accessible.
Looking at the investigation jobs reported in e.g https://openqa.opensuse.org/tests/3724384#comments this same error
shows up on retries with old products and old tests - indicating that we seem to have some underlying infra issue.

Test suite description

Controller performs remote installation via vnc to the target

Reproducible

Fails since (at least) Build 20231114 (current job)

Expected result

Last good: 20231110 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Is duplicate of openQA Infrastructure - action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:MResolvednicksinger2023-11-15

Actions
Actions #1

Updated by okurz 4 months ago

  • Tags set to reactive work, infra
  • Due date set to 2023-11-29
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Priority changed from Normal to High
  • Target version set to Ready

https://suse.slack.com/archives/C04MDKHQE20/p1699958792983489

(Marcus Rueckert) We are migrating download.opensuse.org and related services now.

awaiting end of migration and then monitoring

Actions #2

Updated by dimstar 4 months ago

okurz wrote in #note-1:

https://suse.slack.com/archives/C04MDKHQE20/p1699958792983489

(Marcus Rueckert) We are migrating download.opensuse.org and related services now.

awaiting end of migration and then monitoring

According to https://suse.slack.com/archives/C04MDKHQE20/p1699968217497509 migration was completed yesterday

Actions #3

Updated by okurz 4 months ago

Yes but we observed multiple problems with download.opensuse.org today. And https://status.opensuse.org/ claims at least partial outage of related monitoring services so I don't trust the stability :)

Actions #4

Updated by livdywan 4 months ago

  • Is duplicate of action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added
Actions #5

Updated by okurz 4 months ago

  • Due date deleted (2023-11-29)
  • Status changed from Feedback to Blocked
Actions #6

Updated by okurz 3 months ago

reminded in the blocker #134123-34

Actions #7

Updated by okurz 3 months ago

  • Priority changed from High to Normal

#134123 is already high, following here with lower prio

Actions #8

Updated by okurz 3 months ago

  • Status changed from Blocked to Resolved

we are trying to debug openQA multi-machine tests on openqaworker-arm21+arm22 and struggle to come up with a manual qemu command line that makes the machine boot. I tried e.g. /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci,edid=on,xres=1024,yres=768 -m 2048 -machine virt,gic-version=host -cpu host -mem-prealloc -mem-path /dev/hugepages/ -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:19:34:56 -enable-kvm -no-shutdown -vnc :91,share=force-shared but when connecting over VNC I only see that the guest has not initialized the display yet. Any ideas what's missing?

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815206 PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0

rescue_system@aarch64 -> https://openqa.opensuse.org/tests/3816230

and to crosscheck with warm21

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815206 PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21

Both tests start up just fine but the developer mode actually can not be connected over the webUI, for both of them. Crosschecked with another machine and https://openqa.opensuse.org/tests/3816232#live for openqaworker24 is just fine. Aso both tests are not related to multi-machine and actually don't configure network in the rescue mode those are not helping towards debugging network within the hosts but the question is why the hosts do not allow to reach the developer mode. Well, in the end mkittler fixed it. Likely wrong combinations of os-autoinst-setup-multi-machine and parameters were triggered.

So let's try developer mode and the only real multi-machine scenario on aarch64

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815268 PAUSE_AT=libzypp_config {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21

-> https://openqa.opensuse.org/tests/3816247

but I didn't follow up on the same day so eventually the job ran into timeout as expected.

In the end mkittler fixed the config at least on warm21 with

firewall-cmd --zone public --remove-interface=eth0
firewall-cmd --zone trusted --add-interface=eth0

I assume that nicksinger executed os-autoinst-setup-multi-machine but did not see (or ignore) error messages about the inability to add eth0 to trusted as it was already in public according to #150920-22 ?

Anyway now the developer mode works fine and so far also I assume multi-machine tests on warm21 so I added back "tap" to warm21:/etc/openqa/workers.ini and I am monitoring jobs.

Due to the new worker class there was a sudden rise of jobs and also jobs ending up incomplete with "Reason: cache failure: Cache service queue already full (5)" which is unfortunate. I now increased the cache size which is at least slightly related and might help. On both warm21+warm22 we have a 6TB NVMe from which we run most mount points including /var/lib/openqa/cache with enough free space so I set

# Limit size of CACHEDIRECTORY to the specified value in GiB (50 GiB by default)
CACHELIMIT = 1000

Multiple multi-machine tests now passed including

so I assume it's ok if we keep tap. Now to warm22.

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815268 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22
openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3818743 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22

yep, also fine.

So let's check across both warm21+warm22:

openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3818743 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS:ovs-server=openqaworker-arm22 WORKER_CLASS:ovs-client=openqaworker-arm21

Both successful so enabled "tap" for production on openqaworker-arm22 again as well. With that we can resolve here.

Actions

Also available in: Atom PDF