action #99345
action #107062: Multiple failures due to network issues
[tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry size:M
0%
Description
Observation¶
openQA test in scenario sle-12-SP4-Server-DVD-Updates-s390x-mru-install-minimal-with-addons@s390x-kvm-sle12 incomplete, stops at
start_install
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.
Reproducible¶
Fails since (at least) Build 20210927-1 (current job)
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#99345
Expected result¶
Last good: 20210925-1 (or more recent)
Acceptance criteria¶
- AC1: The root cause of the problem is known
- AC2: The next steps are known and have been initiated
Suggestions¶
- Talk to all the people involved to get the full story
Further details¶
Always latest result in this scenario: latest
IPMI and s390x workers keep losing VNC connection during SLES installation and the reconnect attempt gets stuck for some strange reason until the job hits MAX_JOB_TIME:
[2022-05-11T13:46:24.602260+02:00] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50) XIO: fatal IO error 4 (Interrupted system call) on X server ":37191" after 23426 requests (23426 known processed) with 0 events remaining. XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":34867" after 28852 requests (28852 known processed) with 0 events remaining. [2022-05-11T15:36:53.708309+02:00] [debug] autotest received signal TERM, saving results of current test before exiting [2022-05-11T15:36:53.708518+02:00] [debug] isotovideo received signal TERM [2022-05-11T15:36:53.708516+02:00] [debug] backend got TERM
Note that the job spent 110 minutes in wait_screen_change() that was supposed to time out after 10 seconds.
In another job it was stuck on assert_screen
:
[2022-05-11T15:58:59.169408+02:00] [debug] <<< testapi::assert_screen(mustmatch="installation", no_wait=1, timeout=30)[2022-05-11T16:15:16.486959+02:00] [warn] !!! consoles::VNC::catch {...} : Error in VNC protocol - relogin: short read for zrle data 659 - 950[2022-05-11T21:47:05.966962+02:00] [debug] backend got TERM[2022-05-11T21:47:05.966980+02:00] [debug] isotovideo received signal TERM[2022-05-11T21:47:05.967084+02:00] [debug] autotest received signal TERM, saving results of current test before exiting XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":44341" after 28689 requests (28689 known processed) with 0 events remaining. XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":60785" after 39307 requests (39307 known processed) with 0 events remaining.
Here we've even got a VNC error and the VNC client would try to re-login but I suppose it is pointless because the VNC server terminates when the connection is lost anyways. So for a real retry we likely needed to also restart the VNC server. (Note that @MDoucha tried to connect manually here, see https://progress.opensuse.org/issues/99345#note-9)
Related issues
History
#1
Updated by cdywan 6 months ago
- Is duplicate of action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused":retry added
#2
Updated by okurz 4 months ago
- Subject changed from [qem] Incomplete test runs on s390x with `backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out` to [tools][qem] Incomplete test runs on s390x with `backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out`
- Target version set to Ready
#3
Updated by okurz 4 months ago
- Is duplicate of deleted (action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused":retry)
#4
Updated by okurz 4 months ago
- Related to action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused":retry added
#6
Updated by okurz 3 months ago
- Subject changed from [tools][qem] Incomplete test runs on s390x with `backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out` to [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry
- Description updated (diff)
- Priority changed from Normal to High
I called export host=openqa.suse.de; bash -ex ./openqa-monitor-incompletes | bash -e ./openqa-label-known-issues
for #108896 and found 63 unknown issues, most of them about "Error connecting to VNC server.*s390" so I bump prio, adapt the subject to act as auto-review control words and added the template text using openqa-query-for-job-label. Now I will call the above openqa-label-known-issues command again.
#8
Updated by okurz about 2 months ago
- Priority changed from High to Low
We decided to work on #107026 first which I will change in prio accordingly.
#9
Updated by MDoucha about 2 months ago
This issue is currently blocking auto-approval of nearly all SLE-12SP5 maintenance updates. See https://openqa.suse.de/tests/8735098#next_previous
IPMI and s390x workers keep losing VNC connection during SLES installation and the reconnect attempt gets stuck for some strange reason until the job hits MAX_JOB_TIME. The logs are not very helpful, reconnect logging could certainly be improved for a start:
https://openqa.suse.de/tests/8735098/logfile?filename=autoinst-log.txt
[2022-05-11T13:46:24.602260+02:00] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50) XIO: fatal IO error 4 (Interrupted system call) on X server ":37191" after 23426 requests (23426 known processed) with 0 events remaining. XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":34867" after 28852 requests (28852 known processed) with 0 events remaining. [2022-05-11T15:36:53.708309+02:00] [debug] autotest received signal TERM, saving results of current test before exiting [2022-05-11T15:36:53.708518+02:00] [debug] isotovideo received signal TERM [2022-05-11T15:36:53.708516+02:00] [debug] backend got TERM
Note that the job spent 110 minutes in wait_screen_change()
that was supposed to time out after 10 seconds.
I've checked the job groups of other SLE versions and it seems that only SLE-12SP5 is affected. I've also tried connecting via VNC to one running job that was stuck for too long and the VNC server was running and responding. It was this job, specifically: http://openqa.qam.suse.cz/tests/42502
#10
Updated by okurz about 2 months ago
- Status changed from Blocked to New
- Priority changed from Low to Urgent
#107026 was resolved so good time to look into this with urgency then.
#11
Updated by mgrifalconi about 2 months ago
- Has duplicate action #110902: [qe-core]qam-minimal-full@s390x-zVM-vswitch-l3 multiple failures added
#12
Updated by cdywan about 2 months ago
- Related to action #111004: Timeout of test API functions not enforced if backend gets stuck, e.g. on the VNC socket size:M added
#13
Updated by okurz about 2 months ago
- Related to coordination #109656: [epic] Stable non-qemu backends added
#14
Updated by cdywan about 2 months ago
- Subject changed from [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry to [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry size:M
- Description updated (diff)
- Status changed from New to Workable
#15
Updated by mkittler about 2 months ago
We'll look into the wait_screen_change
issue separately (see #111004). It is still not clear what's the source of the initially reported issue which also happens under SLE-12-SP4.
However, it it definitely still happening (as well as #76813):
openqa=> select id, t_finished, test, (select value from job_settings where job_id = jobs.id and key = 'BACKEND') as backend, arch, reason from jobs where reason like '%Error connecting to VNC server%' order by id desc limit 40; id | t_finished | test | backend | arch | reason ---------+---------------------+-----------------------------------------------------------------+---------+--------+---------------------------------------------------------------------------------------------------------------------------- 8742546 | 2022-05-12 07:33:59 | mru-install-minimal-with-addons | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.81:5901>: IO::Socket::INET: connect: Connection timed out 8742530 | 2022-05-12 07:26:57 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8742413 | 2022-05-12 06:54:57 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8741962 | 2022-05-12 06:54:54 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8741951 | 2022-05-12 06:21:37 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8741950 | 2022-05-12 06:19:59 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_ssh_pre | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8741935 | 2022-05-12 05:47:10 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8738659 | 2022-05-11 23:10:32 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8738614 | 2022-05-11 21:24:07 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_ssh_pre | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8736240 | 2022-05-11 20:30:21 | jeos-main | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused 8736238 | 2022-05-11 20:30:54 | jeos-extratest | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5902>: IO::Socket::INET: connect: Connection refused 8733540 | 2022-05-11 01:36:13 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733519 | 2022-05-11 01:12:48 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733486 | 2022-05-11 00:47:43 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733444 | 2022-05-11 00:23:46 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733433 | 2022-05-10 23:57:43 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733370 | 2022-05-10 23:35:48 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733348 | 2022-05-10 23:12:18 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733333 | 2022-05-10 22:48:36 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733266 | 2022-05-10 22:25:36 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733151 | 2022-05-10 22:03:57 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733143 | 2022-05-10 21:41:59 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733101 | 2022-05-10 21:20:02 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8733080 | 2022-05-10 20:55:57 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8732193 | 2022-05-10 20:33:46 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out 8731378 | 2022-05-10 20:11:57 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out …
The overall fail ratio of s390x jobs (starting from this year) is also still quite high although the cases where it times out seem to be rather rare:
openqa=> with test_jobs as (select distinct id, result from jobs where arch = 's390x' and t_started >= '2022-01-01 00:00:0') select result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.result order by ratio desc; result | ratio --------------------+------------------------ passed | 55.1260314230812705 failed | 21.7220526732225613 softfailed | 14.3862326212275348 incomplete | 6.7395727365208545 timeout_exceeded | 0.79970611506725443653 parallel_failed | 0.64993783203345766927 obsoleted | 0.25714931615236803436 user_cancelled | 0.14129083305075166723 user_restarted | 0.10172939979654120041 none | 0.05369051655928563355 parallel_restarted | 0.02260653328812026676 (11 Zeilen)
The figures also haven't changed much compared to jobs from all time and compared to jobs from the last month and the last week.
The ratio of VNC issues (at least those mentioned in the ticket description where the job's reason is set accordingly) within failing/incompleting s390x jobs is indeed quite high:
openqa=> with finished as (select result, reason, t_finished from jobs where arch='s390x' and (result='failed' or result='incomplete')) select round(count(*) filter (where reason like '%Error connecting to VNC server%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_vnc_issues, count(*) total from finished where t_finished >= '2022-01-01'; ratio_of_vnc_issues | total ---------------------+------- 15.38 | 10105 (1 Zeile)
Note that "total" is the number of failing/incompleting s390x jobs and the ratio is the ratio of VNC issues within that set of job (not within all s390x jobs).
The problem has become much worse since last February:
openqa=> with finished as (select result, reason, t_finished from jobs where arch='s390x' and (result='failed' or result='incomplete')) select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, round(count(*) filter (where reason like '%Error connecting to VNC server%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_vnc_issues, count(*) total from finished where t_finished >= '2021-01-01' group by year, month order by year, month asc; year | month | ratio_of_vnc_issues | total ------+-------+---------------------+------- 2021 | 1 | 2.41 | 249 2021 | 2 | 0.99 | 304 2021 | 3 | 2.97 | 202 2021 | 4 | 2.27 | 264 2021 | 5 | 0.92 | 109 2021 | 6 | 0.93 | 108 2021 | 7 | 0 | 55 2021 | 8 | 0 | 26 2021 | 9 | 0.6 | 168 2021 | 10 | 0 | 140 2021 | 11 | 0 | 480 2021 | 12 | 0 | 208 2022 | 1 | 0.22 | 455 2022 | 2 | 10.89 | 937 2022 | 3 | 15.16 | 2282 2022 | 4 | 20.66 | 4429 2022 | 5 | 9.49 | 2002 (17 Zeilen)
Note that the overall ratio of failed/incomplete jobs hasn't become much worse, though:
openqa=> with finished as (select result, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, round(count(*) filter (where result = 'failed' or result = 'incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_failures_or_incompletes, count(*) total from finished where t_finished >= '2020-01-01' group by year, month order by year, month asc; year | month | ratio_of_all_failures_or_incompletes | total ------+-------+--------------------------------------+------- 2020 | 1 | 49.39 | 733 2020 | 2 | 80 | 30 2020 | 3 | 31.71 | 820 2020 | 4 | 26.91 | 680 2020 | 5 | 31.01 | 703 2020 | 6 | 33.33 | 516 2020 | 7 | 29.29 | 140 2020 | 8 | 30.95 | 84 2020 | 9 | 41.15 | 418 2020 | 10 | 30.73 | 563 2020 | 11 | 35.29 | 544 2020 | 12 | 30.97 | 452 2021 | 1 | 27.76 | 897 2021 | 2 | 31.9 | 953 2021 | 3 | 30.75 | 657 2021 | 4 | 26.4 | 1000 2021 | 5 | 27.95 | 390 2021 | 6 | 51.67 | 209 2021 | 7 | 35.95 | 153 2021 | 8 | 46.43 | 56 2021 | 9 | 36.76 | 457 2021 | 10 | 32.26 | 434 2021 | 11 | 33.59 | 1429 2021 | 12 | 38.66 | 538 2022 | 1 | 34.89 | 1304 2022 | 2 | 19.01 | 4930 2022 | 3 | 19.74 | 11561 2022 | 4 | 22.45 | 19731 2022 | 5 | 26.77 | 7481
Here "total" is the number of s390x jobs in a particular month. It looks like the number of jobs has more than tripled in February compared to January. Then it doubled and then almost doubled again. So just by looking at these figures one big change since February becomes quite clear: The load in terms of number of jobs has increased a lot.
Note that since April we have launched also a significant number of investigation jobs:
openqa=> with finished as (select result, test, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, round(count(*) filter (where test like '%investigate%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_investigate_jobs, count(*) filter (where test like '%investigate%') as investigate_jobs, count(*) total from finished where t_finished >= '2021-01-01' group by year, month order by year, month asc; year | month | ratio_of_investigate_jobs | investigate_jobs | total ------+-------+---------------------------+------------------+------- 2021 | 1 | 0 | 0 | 897 2021 | 2 | 0 | 0 | 953 2021 | 3 | 0 | 0 | 657 2021 | 4 | 0 | 0 | 1000 2021 | 5 | 0 | 0 | 390 2021 | 6 | 0 | 0 | 209 2021 | 7 | 0 | 0 | 153 2021 | 8 | 0 | 0 | 56 2021 | 9 | 0 | 0 | 457 2021 | 10 | 0 | 0 | 434 2021 | 11 | 0 | 0 | 1429 2021 | 12 | 0 | 0 | 538 2022 | 1 | 0 | 0 | 1304 2022 | 2 | 0.02 | 1 | 4930 2022 | 3 | 0 | 0 | 11561 2022 | 4 | 17.87 | 3525 | 19731 2022 | 5 | 26.11 | 1956 | 7492 (17 Zeilen)
However, that doesn't explain the additional load (we've seen since February).
#16
Updated by MDoucha about 2 months ago
mkittler wrote:
Here "total" is the number of s390x jobs in a particular month. It looks like the number of jobs has more than tripled in February compared to January. Then it doubled and then almost doubled again. So just by looking at these figures one big change since February becomes quite clear: The load in terms of number of jobs has increased a lot.
Note that this isn't a strictly s390x issue. I've had similar problems on Prague IPMI OpenQA instance, again almost exclusively on SLE-12SP5:
http://openqa.qam.suse.cz/tests/42078
http://openqa.qam.suse.cz/tests/42397
http://openqa.qam.suse.cz/tests/42404
http://openqa.qam.suse.cz/tests/42502
I've omitted a few more jobs which have no useful logs due to my mistake.
The common factor is that the tests use VNC over physical network.
#17
Updated by mkittler about 2 months ago
Note that this isn't a strictly s390x issue.
I know. For the sake of simplicity (and because the initial ticket description says so) I reduced my further queries to s390x. My first query also shows a few svirt/x86_64 jobs.
The common factor is that the tests use VNC over physical network.
Yes, I'd definitely agree with that. So jobs using the backends s390x, ipmi and svirt are mostly affected (and the arch doesn't really matter).
#18
Updated by mkittler about 2 months ago
It looks like the number of jobs has more than tripled in February compared to January. Then it doubled and then almost doubled again. So just by looking at these figures one big change since February becomes quite clear: The load in terms of number of jobs has increased a lot.
That's maybe just due to the cleanup. We simply cleaned up these old jobs show the figures from previous months are just lower due to that.
#19
Updated by okurz about 2 months ago
I would like to have some kind of monitoring checks between openQA workers where we run x3270 and the s390x mainframe on where we have s390qa101.qa.suse.de s390qa102.qa.suse.de s390qa103.qa.suse.de s390qa104.qa.suse.de s390qa105.qa.suse.de s390qa106.qa.suse.de
my question would be: As the z/VM LPARs(?) are only online while tests are running and an OS can respond to pings, can we actually ping the mainframe or something to check the stability of the physical connection between the x86_64 openQA worker machine and the mainframe having the SUT VMs?
EDIT: We have discussed this topic in a mob session on 2022-05-12. We could not find good short-term mitigations expect for brute-force retrying (fail ratio of 90% means retry up to 20 times to be on the save side). The challenge with brute-force retrying in the recent "timeout_exceeded" examples like https://openqa.suse.de/tests/8735826/logfile?filename=autoinst-log.txt is that right now we think we could only detect the issue based on the string "XIO: fatal IO error 4 (Interrupted system call) on X server" which likely also appears in other cases and potentially even in passed jobs, at least in a similar way. E.g. in a passed job https://openqa.suse.de/tests/8702743/logfile?filename=autoinst-log.txt we have "XIO: fatal IO error 11 (Resource temporarily unavailable)".
In
https://suse.slack.com/archives/C02CANHLANP/p1652363255685769?thread_ts=1652355734.319689&cid=C02CANHLANP
mdoucha identified that "As for Prague IPMI jobs, the first SLE-12SP5 failure of this kind appeared on April 8th. The previous kernel build on March 7 had clean results". We continued with "Because the tests are relying on the VNC server in the SUT, did you check an older version of SLE? Like 12SP5 GM without maintenance updates?". The issue happens in the ISO installer. We don't have a reliable reproducer on an already installed system. AFAIR there is a way to disable the auto-update or point its update repo to a fake place. Since the s390x jobs show plenty of VNC failures before March 7th it's also likely that the issue got revealed by a VNC client update instead. So even on an older SLE12SP5 GM the issue is reproducible then the VNC client on the worker host could be the culprit.
So for further steps ahead:
- In #111004 we plan to increase unit test coverage for functions like
wait_screen_change
- For this issue #99345 we should focus on mitigation within the realistic cases, e.g. s390x z/VM. Either take a specific z/VM bound openQA worker instance out of production and run test cases locally with just isotovideo (and utilizing https://github.com/os-autoinst/os-autoinst#how-to-run-test-cases with the VNC client options) or a local openQA setup, reproduce the issue(s) and improve logging in place and find the root cause
- Martin Doucha will try an IPMI job with disabled installer auto update. If that fails try to reproduce s390x z/VM jobs with SLE12SP5 GM with no updates applied and auto-updating disabled
- Setup a worker with openSUSE Leap 15.2 or 15.3 with VNC related packages as of before 2022-03 to crosscheck if VNC client updates caused a regression
#20
Updated by mgriessmeier about 2 months ago
okurz wrote:
I would like to have some kind of monitoring checks between openQA workers where we run x3270 and the s390x mainframe on where we have s390qa101.qa.suse.de s390qa102.qa.suse.de s390qa103.qa.suse.de s390qa104.qa.suse.de s390qa105.qa.suse.de s390qa106.qa.suse.de
Hmm I will think about possible solutions to that.
my question would be: As the z/VM LPARs(?) are only online while tests are running and an OS can respond to pings, can we actually ping the mainframe or something to check the stability of the physical connection between the x86_64 openQA worker machine and the mainframe having the SUT VMs?
so s390qa10[1-6] are z/VM guests aka VMs on top of z/VM running on LPAR ZP11 (s390zp11.suse.de). It is true, that the guest (VM) is only online while it is running, aka connected to a x3270 terminal in some way.
so the first checks that comes to my mind is to check if the worker can reach s390zp11.suse.de (which hopefully should always work)
secondly, you can try to connect to it via x3270 (or s3270 on a command line, and check the return code). I would need to check the full syntax and capabilities, which I probably won't be able today. The basic command to connect would be simply s3270 s390zp11.suse.de
, but there are a lot more options to look into.
#21
Updated by MDoucha about 2 months ago
okurz wrote:
- Martin Doucha will try an IPMI job with disabled installer auto update. If that fails try to reproduce s390x z/VM jobs with SLE12SP5 GM with no updates applied and auto-updating disabled
Disabling installer auto update by booting with self_update=0
did not help:
http://openqa.qam.suse.cz/tests/42595
The installer media have last modification date in November 2019. However, some IPMI workers appear to be more prone to this issue than others. I've reproduced the VNC failure on kaylee.qam.suse.cz
and jayne.qam.suse.cz
so far. I haven't seen it happen on any other Prague IPMI machine (yet).
#22
Updated by mkittler about 2 months ago
#111004#note-14 and further comments show that we don't have a general network problem but that it is the VNC server which is stuck. At least this applies to the scenario/problem mentioned in #99345#note-9 which I'm currently using for my investigation as it has a really high error rate.
Of course that raises the question why the VNC server is broken.
#23
Updated by mkittler about 2 months ago
Maybe the VNC server isn't totally broken and vncviewer just cannot cope with it, see #111004#note-16.
#24
Updated by mkittler about 2 months ago
https://github.com/os-autoinst/os-autoinst/pull/2061 improves cases mentioned in #99345#note-9.
As stated before, I'm not so sure whether these are actually network issues or VNC server issues. In my today's investigation for #111004 I couldn't see any clear networking issues, e.g. I could always connect to the VNC server. However, it seems like a connection could get stuck often afterwards. That looked more like the VNC server itself got stuck (e.g. in the thread serving that particular client/connection).
I've just been restarting https://openqa.suse.de/tests/8769221 which failed with Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
. The restarted job runs on openqaworker2 which has my changes for #111004. Let's see whether it also let's this test pass: https://openqa.suse.de/tests/8771492#live
#25
Updated by mkittler about 2 months ago
It passed. So I've been restarting a few other others that failed similarly:
#26
Updated by mkittler about 2 months ago
4 out of 5 restarted jobs that failed with "Connection timed out" have passed. Maybe just luck or already an improvement. At least none of the jobs got stuck and ran into a timeout.
One further workaround could be to simply increase the VNC timeout via a test setting. The default timeout is 60 seconds for remote connections and one could try to increase it, e.g. VNC_CONNECT_TIMEOUT_REMOTE=240
.
Maybe it makes also sense to make multiple connection attempts until the timeout exceeds (instead of one connection attempt with a long timeout). Implementing such a change within our VNC client should be quite simple.
#27
Updated by mkittler about 1 month ago
- Status changed from Workable to In Progress
PR for increasing the connect timeout: https://github.com/os-autoinst/os-autoinst/pull/2064
#28
Updated by openqa_review about 1 month ago
- Due date set to 2022-06-02
Setting due date based on mean cycle time of SUSE QE Tools
#29
Updated by okurz about 1 month ago
The PR looks fine. Do you plan to do the "multiple connecting attempts" in parallel?
#30
Updated by mkittler about 1 month ago
- Related to action #111063: Ping monitoring for our s390z mainframes size:S added
#31
Updated by mkittler about 1 month ago
In parallel? That was never planned. And I also think that makes things only needlessly complicated (e.g. if one attempt succeeds one had to abort other attempts). But now it does multiple connect attempts within the extended timeout in sequence. Let's see whether it helps to reduce the number of those incompletes.
#32
Updated by okurz about 1 month ago
mkittler wrote:
In parallel? That was never planned.
No I meant that just you would work on something in parallel, not the code :D
#33
Updated by mkittler about 1 month ago
The extended timeout has been deployed since 20.05.22 07:23 CEST and the timeout for reads/writes on the VNC socket has been deployed since 18.05.22 07:17 CEST.
I suppose the problem of s390x jobs getting stuck and running into timeouts is resolved:
openqa=> with finished as (select result, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where result = 'timeout_exceeded') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_timeouts, count(*) total from finished where t_finished >= '2022-05-15' group by year, month, day order by year, month, day asc; year | month | day | ratio_of_all_timeouts | total ------+-------+-----+-----------------------+------- 2022 | 5 | 15 | 0 | 404 2022 | 5 | 16 | 2.76 | 507 2022 | 5 | 17 | 0 | 545 2022 | 5 | 18 | 0.88 | 457 2022 | 5 | 19 | 0 | 457 2022 | 5 | 20 | 0 | 187 (6 Zeilen)
Note that the number of s390x jobs with VNC issues isn't that high overall:
openqa=> with finished as (select reason, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_vnc_connect_errors, count(*) total from finished where t_finished >= '2022-05-01' group by year, month, day order by year, month, day asc; year | month | day | ratio_of_all_vnc_connect_errors | total ------+-------+-----+---------------------------------+------- 2022 | 5 | 1 | 0 | 140 2022 | 5 | 2 | 0.45 | 444 2022 | 5 | 3 | 0.3 | 657 2022 | 5 | 4 | 0.42 | 709 2022 | 5 | 5 | 1.63 | 676 2022 | 5 | 6 | 1.11 | 1079 2022 | 5 | 7 | 4.71 | 552 2022 | 5 | 8 | 0.25 | 405 2022 | 5 | 9 | 6.77 | 812 2022 | 5 | 10 | 5.2 | 1249 2022 | 5 | 11 | 1.8 | 333 2022 | 5 | 12 | 6.33 | 837 2022 | 5 | 13 | 4.03 | 1067 2022 | 5 | 14 | 1.07 | 466 2022 | 5 | 15 | 0.74 | 404 2022 | 5 | 16 | 1.97 | 507 2022 | 5 | 17 | 0.37 | 545 2022 | 5 | 18 | 0 | 457 2022 | 5 | 19 | 0.22 | 457 2022 | 5 | 20 | 0 | 192
There was actually only one occurrence within the last three days anymore:
openqa=> with finished as (select result, reason, t_finished from jobs where arch='s390x' and (result='incomplete')) select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC server%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_vnc_issues_within_incompletes, count(*) total_incompletes from finished where t_finished >= '2022-05-01' group by year, month, day order by year, month, day asc; year | month | day | ratio_of_vnc_issues_within_incompletes | total_incompletes ------+-------+-----+----------------------------------------+------------------- 2022 | 5 | 1 | 0 | 1 2022 | 5 | 2 | 50 | 4 2022 | 5 | 3 | 50 | 4 2022 | 5 | 4 | 60 | 5 2022 | 5 | 5 | 100 | 9 2022 | 5 | 6 | 84.62 | 13 2022 | 5 | 7 | 78.13 | 32 2022 | 5 | 8 | 8.33 | 12 2022 | 5 | 9 | 82.81 | 64 2022 | 5 | 10 | 100 | 65 2022 | 5 | 11 | 100 | 6 2022 | 5 | 12 | 65.43 | 81 2022 | 5 | 13 | 87.5 | 48 2022 | 5 | 14 | 71.43 | 7 2022 | 5 | 15 | 10.34 | 29 2022 | 5 | 16 | 80 | 10 2022 | 5 | 17 | 100 | 2 2022 | 5 | 18 | 0 | 13 2022 | 5 | 19 | 100 | 1 2022 | 5 | 20 | 0 | 2
Looking at all jobs, there where only 5 jobs with VNC issues in the last three days and only one of them was an actual timeout:
penqa=> select id, t_finished, test, (select value from job_settings where job_id = jobs.id and key = 'BACKEND') as backend, arch, reason from jobs where reason like '%Error connecting to VNC server%' order by t_finished desc limit 10; id | t_finished | test | backend | arch | reason ---------+---------------------+-----------------------------------------------------------------------------------------------------------+---------+--------+---------------------------------------------------------------------------------------------------------------------------- 8793234 | 2022-05-20 10:04:47 | jeos-containers:investigate:last_good_tests_and_build:1d3a5e7f5d2aa21788b5aacf7d2bde6af6fcbd78+20220518-1 | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused 8793855 | 2022-05-20 10:04:42 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused 8787574 | 2022-05-19 08:48:28 | offline_sles12sp3_ltss_rmt_pcm-contm_def_full | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: timeout 8773498 | 2022-05-18 01:28:07 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused 8773500 | 2022-05-18 01:27:40 | jeos-filesystem | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused 8774388 | 2022-05-17 23:10:07 | qam-minimal+base | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out 8771988 | 2022-05-17 15:08:36 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: timeout 8768593 | 2022-05-16 23:11:32 | qam-minimal+base | svirt | s390x | backend done: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out 8768591 | 2022-05-16 23:10:48 | mru-install-minimal-with-addons | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.93:5901>: IO::Socket::INET: connect: Connection timed out 8769221 | 2022-05-16 21:08:11 | qam-minimal-full | s390x | s390x | backend done: Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: timeout (10 Zeilen)
In https://openqa.suse.de/tests/8793234/logfile?filename=autoinst-log.txt we can see that the increased timeout of 4 minutes is effective. I doubt that extending the timeout further would be helpful (the error here is "Connection refused" so the VNC server was likely not running anymore). I suppose this is actually a different issue (and we have #76813 for that).
https://openqa.suse.de/tests/8787574/logfile?filename=autoinst-log.txt ran before increasing the VNC timeout any it might have helped:
[2022-05-19T10:48:11.285168+02:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: timeout [2022-05-19T10:48:11.285945+02:00] [debug] Closing SSH serial connection with s390zp18.suse.de
I'll check again next week.
#34
Updated by okurz about 1 month ago
Please keep in mind that SLE maintenance tests run much less s390x jobs and SLE validation for new builds has slowed down significantly after SLE15SP4 GMC release so likely you are better off with explicitly triggered tests to get a good statistic.
#35
Updated by mkittler about 1 month ago
Ok, I've been closing the last 100 passing s390x jobs on OSD: https://openqa.suse.de/tests/overview?build=test-s390x
Let's see how high the error rate will be.
#36
Updated by mkittler about 1 month ago
Most of the jobs I've spawned were obsoleted:
openqa=> with test_jobs as (select distinct id, state, result from jobs where build = 'test-s390x') select state, result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.state, test_jobs.result order by ratio desc; state | result | ratio -----------+-----------------+------------------------ cancelled | obsoleted | 76.4705882352941176 done | passed | 16.6666666666666667 done | failed | 3.9215686274509804 done | parallel_failed | 1.9607843137254902 done | softfailed | 0.98039215686274509804 (5 Zeilen)
However, the jobs that ran mostly passed and the few failures were not due to VNC connection issues. In fact, none of the > 1200 s390x jobs ran into the VNC error:
openqa=> with finished as (select reason, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_vnc_connect_errors, count(*) total from finished where t_finished >= '2022-05-19' group by year, month, day order by year, month, day asc; year | month | day | ratio_of_all_vnc_connect_errors | total ------+-------+-----+---------------------------------+------- 2022 | 5 | 19 | 0.22 | 457 2022 | 5 | 20 | 0 | 570 2022 | 5 | 21 | 0 | 508 2022 | 5 | 22 | 0 | 102 2022 | 5 | 23 | 0 | 98 (5 Zeilen)
So I would conclude that the issue isn't very apparent anymore.
#37
Updated by mkittler about 1 month ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
#38
Updated by okurz about 1 month ago
mkittler wrote:
So I would conclude that the issue isn't very apparent anymore.
We wouldn't gain a lot if the issue just become less probable because that would also reproducing and fixing harder. So we should be sure that the same issue realistically can not happen anymore. Of course connections could still time out but if that happens then it should be really clear to the test reviewers what to do. And they should not feel like they need to reopen this ticket or (even worse) just manually restart because they assume it's basically still the same issue.
#39
Updated by mkittler about 1 month ago
- Due date changed from 2022-06-02 to 2022-06-17
I'll be on vacation as of tomorrow. I'll check statistics again when I'm back.
#40
Updated by mkittler about 1 month ago
So far it already looks good. Not a single more job with Error connecting to VNC
.
#41
Updated by mkittler 23 days ago
We had some more VNC issues but the ratio of them is quite low:
openqa=> with finished as (select reason, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_vnc_connect_errors, count(*) total from finished where t_finished >= '2022-05-19' group by year, month, day order by year, month, day asc; year | month | day | ratio_of_all_vnc_connect_errors | total ------+-------+-----+---------------------------------+------- 2022 | 5 | 19 | 0.22 | 446 2022 | 5 | 20 | 0 | 564 2022 | 5 | 21 | 0 | 499 2022 | 5 | 22 | 0 | 96 2022 | 5 | 23 | 0 | 372 2022 | 5 | 24 | 0 | 229 2022 | 5 | 25 | 0 | 383 2022 | 5 | 26 | 0 | 340 2022 | 5 | 27 | 0 | 339 2022 | 5 | 28 | 0 | 180 2022 | 5 | 29 | 0 | 283 2022 | 5 | 30 | 0 | 483 2022 | 5 | 31 | 0 | 437 2022 | 6 | 1 | 0 | 517 2022 | 6 | 2 | 0 | 370 2022 | 6 | 3 | 0.13 | 747 2022 | 6 | 4 | 0 | 356 2022 | 6 | 5 | 0.24 | 410 2022 | 6 | 6 | 0 | 261 2022 | 6 | 7 | 0 | 364 2022 | 6 | 8 | 0 | 447 2022 | 6 | 9 | 0 | 467 (22 Zeilen)
Those are the relevant jobs:
openqa=> select id, t_finished, test, (select value from job_settings where job_id = jobs.id and key = 'BACKEND') as backend, arch, reason from jobs where reason like '%Error connecting to VNC server%' order by t_finished desc limit 10; id | t_finished | test | backend | arch | reason ---------+---------------------+-----------------------------------------------------------------------------------------------------------+---------+--------+---------------------------------------------------------------------------------------------------------------------------- 8888867 | 2022-06-05 10:49:20 | sle-micro_autoyast | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.83:5901>: IO::Socket::INET: connect: timeout 8871077 | 2022-06-03 09:37:44 | qam-minimal-full | s390x | s390x | backend died: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout 8870924 | 2022-06-03 08:42:54 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused 8870930 | 2022-06-03 08:42:53 | jeos-fips | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5902>: IO::Socket::INET: connect: Connection refused 8830494 | 2022-05-27 05:59:24 | jeos-fs_stress | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5902>: IO::Socket::INET: connect: Connection refused 8830463 | 2022-05-27 05:59:19 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused 8793234 | 2022-05-20 10:04:47 | jeos-containers:investigate:last_good_tests_and_build:1d3a5e7f5d2aa21788b5aacf7d2bde6af6fcbd78+20220518-1 | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused 8793855 | 2022-05-20 10:04:42 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused 8787574 | 2022-05-19 08:48:28 | offline_sles12sp3_ltss_rmt_pcm-contm_def_full | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: timeout 8773498 | 2022-05-18 01:28:07 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused (10 Zeilen)
Job 8888867 actually shows a VNC-related error during the installation (see https://openqa.suse.de/tests/8888867#step/installation/2) and that error is not present in the previous run (see https://openqa.suse.de/tests/8909978#step/installation/3) so it might actually be a product bug.
The logs of job 8871077 show that the timeout and re-connect attempts work as expected but were still not enough (so presumably the VNC server got stuck/stopped/crashed):
[2022-06-03T11:33:36.162953+02:00] [warn] !!! consoles::VNC::update_framebuffer: Error in VNC protocol - relogin: short read for zrle data 991 - 2244 [2022-06-03T11:34:38.221298+02:00] [warn] !!! consoles::VNC::login: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout [2022-06-03T11:35:39.273462+02:00] [warn] !!! consoles::VNC::login: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout [2022-06-03T11:36:40.319105+02:00] [warn] !!! consoles::VNC::login: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout [2022-06-03T11:37:41.382654+02:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
The same counts for 8870924 and the further jobs. Not sure whether we can do much about it and why the VNC server on the other end doesn't work anymore. I'm also wondering what the difference between the timeouts and the refused connections is. Maybe in case of the timeouts the VNC server on the other end is technically still listening but stuck? Looks like s390x jobs are normally running into the timeout and x86_64 jobs into connection refused. So those are likely distinct problems.
#42
Updated by mkittler 19 days ago
Only 3 new jobs have failed for the last 4 days (https://openqa.suse.de/tests/8939209, https://openqa.suse.de/tests/8928450, https://openqa.suse.de/tests/8928436). That's still a small percentage. I'm wondering whether that's already acceptable. If not I suppose I'd try to debug the VNC servers themselves because the retry on the client-side should already cover client or connection problems. I just have no idea how to debug the VNC servers. (The svirt/s390x and svirt/x86_64 cases seem to be different as there's also a slightly different error message and the VNC server is maybe provided differently.)
#43
Updated by okurz 19 days ago
mkittler wrote:
Only 3 new jobs have failed for the last 4 days (https://openqa.suse.de/tests/8939209, https://openqa.suse.de/tests/8928450, https://openqa.suse.de/tests/8928436). That's still a small percentage. I'm wondering whether that's already acceptable.
Just put yourself into the shoes of the (poor) openQA test maintainers or maybe a package maintainer that learns that their package update is not accepted due to any of the above test failures. What would you do? Who should handle such failures?
#44
Updated by mkittler 18 days ago
The only thing I can improve right now is having those jobs restarted automatically: https://github.com/os-autoinst/openQA/pull/4702
I actually thought it would already be the case the the current regex isn't matching the remaining cases anymore.
#46
Updated by okurz 17 days ago
- Due date deleted (
2022-06-17) - Status changed from Feedback to Resolved
With that merged and deployed and openqa-query-for-job-label poo#99345
only showing
8871077|2022-06-03 09:37:44|done|incomplete|qam-minimal-full|backend died: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout|openqaworker2 8769221|2022-05-16 21:08:11|done|failed|qam-minimal-full|backend done: Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: timeout|openqaworker2
we assume that this is resolved by now even though we did not dig deeper into VNC server debugging. Anyway, for new specific problem observations please open a separate, dedicate ticket