action #152578
closedMany incompletes with "Error connecting to VNC server <unreal6.qe.nue2.suse.org:...>" size:M
0%
Description
Observation¶
There seems to be a problem connecting to unreal6.qe.nue2.suse.org for several weeks now.
https://openqa.suse.de/tests/13062217
Reason: backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.org:5904>: IO::Socket::INET: connect: Connection refused
select count(id), substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-11-01T00:00:00' and result = 'incomplete' group by reason_substr order by count(id) desc;
Suggestions¶
- Take unreal6 out of prod by disabling the slot(s) on all relevant worker hosts?
- But that might not be the best idea because the worker slots don't seem generally broken
- e.g. https://openqa.suse.de/tests/13064403 and https://openqa.suse.de/tests/13064408 pass even though it's using unreal6 (https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=JeOS-for-kvm-and-xen-Updates&machine=svirt-xen-pv&test=jeos-containers-docker&version=15-SP5, https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=JeOS-for-kvm-and-xen-Updates&machine=svirt-xen-hvm&test=jeos-containers-podman&version=15-SP5)
- But that might not be the best idea because the worker slots don't seem generally broken
- Find out what's wrong on unreal6 by investigating that jump host or asking test maintainers
- Confirm if the test itself may be broken versus a general issue with the vnc backend
- Maybe this is a product issue - it's all SLE15SP6 by the looks of it?
Problem¶
H1 REJECTED The product has changed -> It happened for several different builds, and there were also successful tests with the same builds
- H1.1 product changed slightly but in an acceptable way without the need for communication with DEV+RM --> adapt test
- H1.2 product changed slightly but in an acceptable way found after feedback from RM --> adapt test
- H1.3 product changed significantly --> after approval by RM adapt test
H2 Fails because of changes in test setup
- H2.1 Our test hardware equipment behaves different
- H2.2 The network behaves different
H3 Fails because of changes in test infrastructure software, e.g. os-autoinst, openQA
H4 Fails because of changes in test management configuration, e.g. openQA database settings
H5 Fails because of changes in the test software itself (the test plan in source code as well as needles)
H6 Sporadic issue, i.e. the root problem is already hidden in the system for a long time but does not show symptoms every time
Updated by tinita 11 months ago
- Related to action #152569: Many incomplete jobs endlessly restarted over several weeks size:M added
Updated by tinita 11 months ago
- Related to action #152560: [alert] Incomplete jobs (not restarted) of last 24h alert Salt added
Updated by okurz 11 months ago ยท Edited
Crosschecking with "last good" build retriggered
openqa-clone-job --within-instance https://openqa.suse.de/tests/12841342 {TEST,BUILD}+=-poo152578 _GROUP=0
-> https://openqa.suse.de/tests/13112758
failed with what looks like the same problem so either there is no product regression or despite using the old "last good" iso the test does not reflect the complete state of product as of "last good".
Potentially related:
tinita and me monitored "virsh list" on unreal6 while a test was running and we observed that the VM was not or not anymore running after the initial steps.
This reminds me of
https://bugzilla.suse.com/show_bug.cgi?id=1209245
Updated by tinita 11 months ago
I looked into how many incompletes we had in the last days.
It turns out that starting since december 17 we only have investigate jobs having that problem:
openqa=> select count(id), test, substring(reason from 0 for 70) as reason_substr from jobs where t_finished >= '2023-12-17T00:00:00' and result = 'incomplete' and reason like '%unreal6%' group by reason_substr, test order by count(id) desc;
count | test | reason_substr
-------+--------------------------------------------------------------------------------------------+-----------------------------------------------------------------------
378 | jeos-base+sdk+desktop:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
206 | jeos-filesystem:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
205 | jeos-extratest:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
205 | jeos-kdump:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
204 | jeos-containers-docker:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
107 | memtest:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
106 | memtest:investigate:last_good_tests:3a3104f2ab3bc31d94191dc20635f191ef914fe2 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
105 | jeos-base+sdk+desktop:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
105 | jeos-containers-podman:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
104 | jeos-fs_stress:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
102 | jeos-main:investigate:last_good_tests:62553f401b66a1ec01fa037476113a1a42016150 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
100 | jeos-extratest:investigate:last_good_tests:f99df70fc4702425fc55668a06d45bc639bf5056 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
100 | jeos-fips:investigate:retry | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
10 | memtest-poo152578 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
1 | jeos-extratest:investigate:last_good_build:2.37 | backend died: Error connecting to VNC server <unreal6.qe.nue2.suse.or
(15 rows)
Real tests could reappear when the corresponding tests are scheduled again, though.
Updated by tinita 11 months ago
- Status changed from In Progress to Feedback
Asked in Slack: https://suse.slack.com/archives/C02CANHLANP/p1702993231776479
Updated by okurz 8 months ago
- Related to action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused" added
Updated by okurz 8 months ago
- Due date deleted (
2024-01-12) - Status changed from Feedback to Workable
- Assignee deleted (
tinita) - Target version changed from Tools - Next to Ready
after 2 months as tinita has confirmed that there was no response in Slack and apparently nobody else asked. We should mob around that issue as a team.
Updated by tinita 8 months ago
- Status changed from Workable to Resolved
- Assignee set to tinita
I had a look again and the last incomplete with that message is from february 19, and the last test is actually passed: https://openqa.suse.de/tests/13062217#next_previous
So I think this can be resolved now.