action #150911
closedremote_{vnc,ssh}_controller: unable to refresh repo download.o.o
0%
Description
Observation¶
openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-remote_vnc_controller@64bit fails in
await_install
The test tries endlessly to refresh the download.o.o/update/tumbleweed repo - constantly returning repo not accessible.
Looking at the investigation jobs reported in e.g https://openqa.opensuse.org/tests/3724384#comments this same error
shows up on retries with old products and old tests - indicating that we seem to have some underlying infra issue.
Test suite description¶
Controller performs remote installation via vnc to the target
Reproducible¶
Fails since (at least) Build 20231114 (current job)
Expected result¶
Last good: 20231110 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by okurz about 1 year ago
- Tags set to reactive work, infra
- Due date set to 2023-11-29
- Status changed from New to Feedback
- Assignee set to okurz
- Priority changed from Normal to High
- Target version set to Ready
https://suse.slack.com/archives/C04MDKHQE20/p1699958792983489
(Marcus Rueckert) We are migrating download.opensuse.org and related services now.
awaiting end of migration and then monitoring
Updated by dimstar about 1 year ago
okurz wrote in #note-1:
https://suse.slack.com/archives/C04MDKHQE20/p1699958792983489
(Marcus Rueckert) We are migrating download.opensuse.org and related services now.
awaiting end of migration and then monitoring
According to https://suse.slack.com/archives/C04MDKHQE20/p1699968217497509 migration was completed yesterday
Updated by okurz about 1 year ago
Yes but we observed multiple problems with download.opensuse.org today. And https://status.opensuse.org/ claims at least partial outage of related monitoring services so I don't trust the stability :)
Updated by livdywan about 1 year ago
- Is duplicate of action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added
Updated by okurz about 1 year ago
- Due date deleted (
2023-11-29) - Status changed from Feedback to Blocked
Updated by okurz about 1 year ago
- Priority changed from High to Normal
#134123 is already high, following here with lower prio
Updated by okurz about 1 year ago
- Status changed from Blocked to Resolved
we are trying to debug openQA multi-machine tests on openqaworker-arm21+arm22 and struggle to come up with a manual qemu command line that makes the machine boot. I tried e.g. /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci,edid=on,xres=1024,yres=768 -m 2048 -machine virt,gic-version=host -cpu host -mem-prealloc -mem-path /dev/hugepages/ -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:19:34:56 -enable-kvm -no-shutdown -vnc :91,share=force-shared
but when connecting over VNC I only see that the guest has not initialized the display yet. Any ideas what's missing?
openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815206 PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0
rescue_system@aarch64 -> https://openqa.opensuse.org/tests/3816230
and to crosscheck with warm21
openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815206 PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21
- opensuse-Tumbleweed-DVD-aarch64-Build20231214-rescue_system@aarch64 -> https://openqa.opensuse.org/tests/3816231
Both tests start up just fine but the developer mode actually can not be connected over the webUI, for both of them. Crosschecked with another machine and https://openqa.opensuse.org/tests/3816232#live for openqaworker24 is just fine. Aso both tests are not related to multi-machine and actually don't configure network in the rescue mode those are not helping towards debugging network within the hosts but the question is why the hosts do not allow to reach the developer mode. Well, in the end mkittler fixed it. Likely wrong combinations of os-autoinst-setup-multi-machine and parameters were triggered.
So let's try developer mode and the only real multi-machine scenario on aarch64
openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815268 PAUSE_AT=libzypp_config {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21
-> https://openqa.opensuse.org/tests/3816247
but I didn't follow up on the same day so eventually the job ran into timeout as expected.
In the end mkittler fixed the config at least on warm21 with
firewall-cmd --zone public --remove-interface=eth0
firewall-cmd --zone trusted --add-interface=eth0
I assume that nicksinger executed os-autoinst-setup-multi-machine but did not see (or ignore) error messages about the inability to add eth0 to trusted as it was already in public according to #150920-22 ?
Anyway now the developer mode works fine and so far also I assume multi-machine tests on warm21 so I added back "tap" to warm21:/etc/openqa/workers.ini and I am monitoring jobs.
Due to the new worker class there was a sudden rise of jobs and also jobs ending up incomplete with "Reason: cache failure: Cache service queue already full (5)" which is unfortunate. I now increased the cache size which is at least slightly related and might help. On both warm21+warm22 we have a 6TB NVMe from which we run most mount points including /var/lib/openqa/cache with enough free space so I set
# Limit size of CACHEDIRECTORY to the specified value in GiB (50 GiB by default)
CACHELIMIT = 1000
Multiple multi-machine tests now passed including
- https://openqa.opensuse.org/tests/3818709 remote_ssh_controller
- https://openqa.opensuse.org/tests/3818743 ovs-server
- https://openqa.opensuse.org/tests/3818735 security_389ds_sssd_client
- https://openqa.opensuse.org/tests/3818721 rsync-client
so I assume it's ok if we keep tap. Now to warm22.
openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815268 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22
openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3818743 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22
- microos-Tumbleweed-DVD-aarch64-Build20231214-remote_ssh_controller@aarch64 -> https://openqa.opensuse.org/tests/3818773
- microos-Tumbleweed-DVD-aarch64-Build20231214-remote_ssh_target@aarch64 -> https://openqa.opensuse.org/tests/3818772
- opensuse-Tumbleweed-DVD-aarch64-Build20231215-ovs-server@aarch64 -> https://openqa.opensuse.org/tests/3818777
- opensuse-Tumbleweed-DVD-aarch64-Build20231215-ovs-client@aarch64 -> https://openqa.opensuse.org/tests/3818778
yep, also fine.
So let's check across both warm21+warm22:
openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3818743 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS:ovs-server=openqaworker-arm22 WORKER_CLASS:ovs-client=openqaworker-arm21
- opensuse-Tumbleweed-DVD-aarch64-Build20231215-ovs-server@aarch64 -> https://openqa.opensuse.org/tests/3818913
- opensuse-Tumbleweed-DVD-aarch64-Build20231215-ovs-client@aarch64 -> https://openqa.opensuse.org/tests/3818912
Both successful so enabled "tap" for production on openqaworker-arm22 again as well. With that we can resolve here.