action #150911: remote_{vnc,ssh}_controller: unable to refresh repo download.o.o - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #150911

closed

remote_{vnc,ssh}_controller: unable to refresh repo download.o.o

Added by dimstar over 1 year ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Bugs in existing tests

Target version:

openQA Project (public) - Ready

Start date:

2023-11-15

Due date:

% Done:

Estimated time:

Difficulty:

Tags:

infra, reactive work

Description

Observation¶

openQA test in scenario opensuse-Tumbleweed-DVD-x86_64-remote_vnc_controller@64bit fails in
await_install

The test tries endlessly to refresh the download.o.o/update/tumbleweed repo - constantly returning repo not accessible.
Looking at the investigation jobs reported in e.g https://openqa.opensuse.org/tests/3724384#comments this same error
shows up on retries with old products and old tests - indicating that we seem to have some underlying infra issue.

Test suite description¶

Controller performs remote installation via vnc to the target

Reproducible¶

Fails since (at least) Build 20231114 (current job)

Expected result¶

Last good: 20231110 (or more recent)

Further details¶

Always latest result in this scenario: latest

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Tags set to reactive work, infra
Due date set to 2023-11-29
Status changed from New to Feedback
Assignee set to okurz
Priority changed from Normal to High
Target version set to Ready

https://suse.slack.com/archives/C04MDKHQE20/p1699958792983489

(Marcus Rueckert) We are migrating download.opensuse.org and related services now.

awaiting end of migration and then monitoring

Actions

Copy link

Updated by dimstar over 1 year ago

okurz wrote in #note-1:

https://suse.slack.com/archives/C04MDKHQE20/p1699958792983489

(Marcus Rueckert) We are migrating download.opensuse.org and related services now.

awaiting end of migration and then monitoring

According to https://suse.slack.com/archives/C04MDKHQE20/p1699968217497509 migration was completed yesterday

Actions

Copy link

Updated by okurz over 1 year ago

Yes but we observed multiple problems with download.opensuse.org today. And https://status.opensuse.org/ claims at least partial outage of related monitoring services so I don't trust the stability :)

Actions

Copy link

Updated by livdywan over 1 year ago

Is duplicate of action #150920: openqaworker-arm22 is unable to join download.opensuse.org in parallel tests = tap mode size:M added

Actions

Copy link

Updated by okurz over 1 year ago

Due date deleted (~~2023-11-29~~)
Status changed from Feedback to Blocked

#150920

Actions

Copy link

Updated by okurz over 1 year ago

reminded in the blocker #134123-34

Actions

Copy link

Updated by okurz over 1 year ago

Priority changed from High to Normal

#134123 is already high, following here with lower prio

Actions

Copy link

Updated by okurz over 1 year ago

Status changed from Blocked to Resolved

we are trying to debug openQA multi-machine tests on openqaworker-arm21+arm22 and struggle to come up with a manual qemu command line that makes the machine boot. I tried e.g. /usr/bin/qemu-system-aarch64 -device virtio-gpu-pci,edid=on,xres=1024,yres=768 -m 2048 -machine virt,gic-version=host -cpu host -mem-prealloc -mem-path /dev/hugepages/ -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:19:34:56 -enable-kvm -no-shutdown -vnc :91,share=force-shared but when connecting over VNC I only see that the guest has not initialized the display yet. Any ideas what's missing?

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815206 PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0

rescue_system@aarch64 -> https://openqa.opensuse.org/tests/3816230

and to crosscheck with warm21

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815206 PAUSE_AT=rescuesystem_validate_131 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21

opensuse-Tumbleweed-DVD-aarch64-Build20231214-rescue_system@aarch64 -> https://openqa.opensuse.org/tests/3816231

Both tests start up just fine but the developer mode actually can not be connected over the webUI, for both of them. Crosschecked with another machine and https://openqa.opensuse.org/tests/3816232#live for openqaworker24 is just fine. Aso both tests are not related to multi-machine and actually don't configure network in the rescue mode those are not helping towards debugging network within the hosts but the question is why the hosts do not allow to reach the developer mode. Well, in the end mkittler fixed it. Likely wrong combinations of os-autoinst-setup-multi-machine and parameters were triggered.

So let's try developer mode and the only real multi-machine scenario on aarch64

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815268 PAUSE_AT=libzypp_config {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm21

-> https://openqa.opensuse.org/tests/3816247

but I didn't follow up on the same day so eventually the job ran into timeout as expected.

In the end mkittler fixed the config at least on warm21 with

firewall-cmd --zone public --remove-interface=eth0
firewall-cmd --zone trusted --add-interface=eth0

I assume that nicksinger executed os-autoinst-setup-multi-machine but did not see (or ignore) error messages about the inability to add eth0 to trusted as it was already in public according to #150920-22 ?

Anyway now the developer mode works fine and so far also I assume multi-machine tests on warm21 so I added back "tap" to warm21:/etc/openqa/workers.ini and I am monitoring jobs.

Due to the new worker class there was a sudden rise of jobs and also jobs ending up incomplete with "Reason: cache failure: Cache service queue already full (5)" which is unfortunate. I now increased the cache size which is at least slightly related and might help. On both warm21+warm22 we have a 6TB NVMe from which we run most mount points including /var/lib/openqa/cache with enough free space so I set

# Limit size of CACHEDIRECTORY to the specified value in GiB (50 GiB by default)
CACHELIMIT = 1000

Multiple multi-machine tests now passed including

https://openqa.opensuse.org/tests/3818709 remote_ssh_controller
https://openqa.opensuse.org/tests/3818743 ovs-server
https://openqa.opensuse.org/tests/3818735 security_389ds_sssd_client
https://openqa.opensuse.org/tests/3818721 rsync-client

so I assume it's ok if we keep tap. Now to warm22.

openqa-clone-job --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3815268 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22
openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3818743 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS=openqaworker-arm22

microos-Tumbleweed-DVD-aarch64-Build20231214-remote_ssh_controller@aarch64 -> https://openqa.opensuse.org/tests/3818773
microos-Tumbleweed-DVD-aarch64-Build20231214-remote_ssh_target@aarch64 -> https://openqa.opensuse.org/tests/3818772
opensuse-Tumbleweed-DVD-aarch64-Build20231215-ovs-server@aarch64 -> https://openqa.opensuse.org/tests/3818777
opensuse-Tumbleweed-DVD-aarch64-Build20231215-ovs-client@aarch64 -> https://openqa.opensuse.org/tests/3818778

yep, also fine.

So let's check across both warm21+warm22:

openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.opensuse.org/tests/3818743 {BUILD,TEST}+=-poo150920 _GROUP=0 WORKER_CLASS:ovs-server=openqaworker-arm22 WORKER_CLASS:ovs-client=openqaworker-arm21

opensuse-Tumbleweed-DVD-aarch64-Build20231215-ovs-server@aarch64 -> https://openqa.opensuse.org/tests/3818913
opensuse-Tumbleweed-DVD-aarch64-Build20231215-ovs-client@aarch64 -> https://openqa.opensuse.org/tests/3818912

Both successful so enabled "tap" for production on openqaworker-arm22 again as well. With that we can resolve here.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #150911

remote_{vnc,ssh}_controller: unable to refresh repo download.o.o

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by okurz over 1 year ago

Updated by dimstar over 1 year ago

Updated by okurz over 1 year ago

Updated by livdywan over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago