action #150911


remote_{vnc,ssh}_controller: unable to refresh repo download.o.o

Added by dimstar 4 months ago. Updated 3 months ago.

Bugs in existing tests
openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-remote_vnc_controller@64bit fails in

The test tries endlessly to refresh the download.o.o/update/tumbleweed repo - constantly returning repo not accessible.
Looking at the investigation jobs reported in e.g this same error
shows up on retries with old products and old tests - indicating that we seem to have some underlying infra issue.

Test suite description

Controller performs remote installation via vnc to the target


Fails since (at least) Build 20231114 (current job)

Expected result

Last good: 20231110 (or more recent)

Further details

Always latest result in this scenario: latest

Updated by okurz 4 months ago

  • Tags set to reactive work, infra
  • Due date set to 2023-11-29
  • Status changed from New to Feedback
  • Assignee set to okurz
  • Priority changed from Normal to High
  • Target version set to Ready

(Marcus Rueckert) We are migrating and related services now.

awaiting end of migration and then monitoring

Updated by dimstar 4 months ago

okurz wrote in #note-1:

(Marcus Rueckert) We are migrating and related services now.

awaiting end of migration and then monitoring

According to migration was completed yesterday

Updated by okurz 4 months ago

Yes but we observed multiple problems with today. And claims at least partial outage of related monitoring services so I don't trust the stability :)

Actions #4

Updated by livdywan 4 months ago

  • Is duplicate of action #150920: openqaworker-arm22 is unable to join in parallel tests = tap mode size:M added
Updated by okurz 4 months ago

  • Due date deleted (2023-11-29)
  • Status changed from Feedback to Blocked
Actions #6

Updated by okurz 3 months ago

reminded in the blocker #134123-34

Updated by okurz 3 months ago

  • Priority changed from High to Normal

#134123 is already high, following here with lower prio

Updated by okurz 3 months ago

  • Status changed from Blocked to Resolved

we are trying to debug openQA multi-machine tests on openqaworker-arm21+arm22 and struggle to come up with a manual qemu command line that makes the machine boot. I tried e.g. /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci,edid=on,xres=1024,yres=768 -m 2048 -machine virt,gic-version=host -cpu host -mem-prealloc -mem-path /dev/hugepages/ -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:19:34:56 -enable-kvm -no-shutdown -vnc :91,share=force-shared but when connecting over VNC I only see that the guest has not initialized the display yet. Any ideas what's missing?

openqa-clone-job --parental-inheritance --within-instance PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0

rescue_system@aarch64 ->

and to crosscheck with warm21

openqa-clone-job --parental-inheritance --within-instance PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21

Both tests start up just fine but the developer mode actually can not be connected over the webUI, for both of them. Crosschecked with another machine and for openqaworker24 is just fine. Aso both tests are not related to multi-machine and actually don't configure network in the rescue mode those are not helping towards debugging network within the hosts but the question is why the hosts do not allow to reach the developer mode. Well, in the end mkittler fixed it. Likely wrong combinations of os-autoinst-setup-multi-machine and parameters were triggered.

So let's try developer mode and the only real multi-machine scenario on aarch64

openqa-clone-job --parental-inheritance --within-instance PAUSE_AT=libzypp_config {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21


but I didn't follow up on the same day so eventually the job ran into timeout as expected.

In the end mkittler fixed the config at least on warm21 with

firewall-cmd --zone public --remove-interface=eth0
firewall-cmd --zone trusted --add-interface=eth0

I assume that nicksinger executed os-autoinst-setup-multi-machine but did not see (or ignore) error messages about the inability to add eth0 to trusted as it was already in public according to #150920-22 ?

Anyway now the developer mode works fine and so far also I assume multi-machine tests on warm21 so I added back "tap" to warm21:/etc/openqa/workers.ini and I am monitoring jobs.

Due to the new worker class there was a sudden rise of jobs and also jobs ending up incomplete with "Reason: cache failure: Cache service queue already full (5)" which is unfortunate. I now increased the cache size which is at least slightly related and might help. On both warm21+warm22 we have a 6TB NVMe from which we run most mount points including /var/lib/openqa/cache with enough free space so I set

# Limit size of CACHEDIRECTORY to the specified value in GiB (50 GiB by default)

Multiple multi-machine tests now passed including

so I assume it's ok if we keep tap. Now to warm22.

openqa-clone-job --parental-inheritance --within-instance {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22
openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22

yep, also fine.

So let's check across both warm21+warm22:

openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS:ovs-server=openqaworker-arm22 WORKER_CLASS:ovs-client=openqaworker-arm21

Both successful so enabled "tap" for production on openqaworker-arm22 again as well. With that we can resolve here.


