action #99345
closedaction #107062: Multiple failures due to network issues
[tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry size:M
0%
Description
Observation¶
openQA test in scenario sle-12-SP4-Server-DVD-Updates-s390x-mru-install-minimal-with-addons@s390x-kvm-sle12 incomplete, stops at
start_install
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.
Reproducible¶
Fails since (at least) Build 20210927-1 (current job)
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#99345
Expected result¶
Last good: 20210925-1 (or more recent)
Acceptance criteria¶
- AC1: The root cause of the problem is known
- AC2: The next steps are known and have been initiated
Suggestions¶
- Talk to all the people involved to get the full story
Further details¶
Always latest result in this scenario: latest
IPMI and s390x workers keep losing VNC connection during SLES installation and the reconnect attempt gets stuck for some strange reason until the job hits MAX_JOB_TIME:
[2022-05-11T13:46:24.602260+02:00] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
XIO: fatal IO error 4 (Interrupted system call) on X server ":37191"
after 23426 requests (23426 known processed) with 0 events remaining.
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":34867"
after 28852 requests (28852 known processed) with 0 events remaining.
[2022-05-11T15:36:53.708309+02:00] [debug] autotest received signal TERM, saving results of current test before exiting
[2022-05-11T15:36:53.708518+02:00] [debug] isotovideo received signal TERM
[2022-05-11T15:36:53.708516+02:00] [debug] backend got TERM
Note that the job spent 110 minutes in wait_screen_change() that was supposed to time out after 10 seconds.
In another job it was stuck on assert_screen
:
[2022-05-11T15:58:59.169408+02:00] [debug] <<< testapi::assert_screen(mustmatch="installation", no_wait=1, timeout=30)[2022-05-11T16:15:16.486959+02:00] [warn] !!! consoles::VNC::catch {...} : Error in VNC protocol - relogin: short read for zrle data 659 - 950[2022-05-11T21:47:05.966962+02:00] [debug] backend got TERM[2022-05-11T21:47:05.966980+02:00] [debug] isotovideo received signal TERM[2022-05-11T21:47:05.967084+02:00] [debug] autotest received signal TERM, saving results of current test before exiting
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":44341"
after 28689 requests (28689 known processed) with 0 events remaining.
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":60785"
after 39307 requests (39307 known processed) with 0 events remaining.
Here we've even got a VNC error and the VNC client would try to re-login but I suppose it is pointless because the VNC server terminates when the connection is lost anyways. So for a real retry we likely needed to also restart the VNC server. (Note that @MDoucha tried to connect manually here, see https://progress.opensuse.org/issues/99345#note-9)
Updated by livdywan over 2 years ago
- Is duplicate of action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused" added
Updated by okurz over 2 years ago
- Subject changed from [qem] Incomplete test runs on s390x with `backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out` to [tools][qem] Incomplete test runs on s390x with `backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out`
- Target version set to Ready
Updated by okurz over 2 years ago
- Is duplicate of deleted (action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused")
Updated by okurz over 2 years ago
- Related to action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused" added
Updated by okurz over 2 years ago
- Status changed from New to Blocked
- Assignee set to mkittler
Updated by okurz over 2 years ago
- Subject changed from [tools][qem] Incomplete test runs on s390x with `backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out` to [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry
- Description updated (diff)
- Priority changed from Normal to High
I called export host=openqa.suse.de; bash -ex ./openqa-monitor-incompletes | bash -e ./openqa-label-known-issues
for #108896 and found 63 unknown issues, most of them about "Error connecting to VNC server.*s390" so I bump prio, adapt the subject to act as auto-review control words and added the template text using openqa-query-for-job-label. Now I will call the above openqa-label-known-issues command again.
Updated by okurz over 2 years ago
- Priority changed from High to Low
We decided to work on #107026 first which I will change in prio accordingly.
Updated by MDoucha over 2 years ago
This issue is currently blocking auto-approval of nearly all SLE-12SP5 maintenance updates. See https://openqa.suse.de/tests/8735098#next_previous
IPMI and s390x workers keep losing VNC connection during SLES installation and the reconnect attempt gets stuck for some strange reason until the job hits MAX_JOB_TIME. The logs are not very helpful, reconnect logging could certainly be improved for a start:
https://openqa.suse.de/tests/8735098/logfile?filename=autoinst-log.txt
[2022-05-11T13:46:24.602260+02:00] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
XIO: fatal IO error 4 (Interrupted system call) on X server ":37191"
after 23426 requests (23426 known processed) with 0 events remaining.
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":34867"
after 28852 requests (28852 known processed) with 0 events remaining.
[2022-05-11T15:36:53.708309+02:00] [debug] autotest received signal TERM, saving results of current test before exiting
[2022-05-11T15:36:53.708518+02:00] [debug] isotovideo received signal TERM
[2022-05-11T15:36:53.708516+02:00] [debug] backend got TERM
Note that the job spent 110 minutes in wait_screen_change()
that was supposed to time out after 10 seconds.
I've checked the job groups of other SLE versions and it seems that only SLE-12SP5 is affected. I've also tried connecting via VNC to one running job that was stuck for too long and the VNC server was running and responding. It was this job, specifically: http://openqa.qam.suse.cz/tests/42502
Updated by okurz over 2 years ago
- Status changed from Blocked to New
- Priority changed from Low to Urgent
#107026 was resolved so good time to look into this with urgency then.
Updated by mgrifalconi over 2 years ago
- Has duplicate action #110902: [qe-core]qam-minimal-full@s390x-zVM-vswitch-l3 multiple failures added
Updated by livdywan over 2 years ago
- Related to action #111004: Timeout of test API functions not enforced if backend gets stuck, e.g. on the VNC socket size:M added
Updated by okurz over 2 years ago
- Related to coordination #109656: [epic] Stable non-qemu backends added
Updated by livdywan over 2 years ago
- Subject changed from [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry to [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler over 2 years ago
We'll look into the wait_screen_change
issue separately (see #111004). It is still not clear what's the source of the initially reported issue which also happens under SLE-12-SP4.
However, it it definitely still happening (as well as #76813):
openqa=> select id, t_finished, test, (select value from job_settings where job_id = jobs.id and key = 'BACKEND') as backend, arch, reason from jobs where reason like '%Error connecting to VNC server%' order by id desc limit 40;
id | t_finished | test | backend | arch | reason
---------+---------------------+-----------------------------------------------------------------+---------+--------+----------------------------------------------------------------------------------------------------------------------------
8742546 | 2022-05-12 07:33:59 | mru-install-minimal-with-addons | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.81:5901>: IO::Socket::INET: connect: Connection timed out
8742530 | 2022-05-12 07:26:57 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8742413 | 2022-05-12 06:54:57 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8741962 | 2022-05-12 06:54:54 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8741951 | 2022-05-12 06:21:37 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8741950 | 2022-05-12 06:19:59 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_ssh_pre | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8741935 | 2022-05-12 05:47:10 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8738659 | 2022-05-11 23:10:32 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8738614 | 2022-05-11 21:24:07 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_ssh_pre | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8736240 | 2022-05-11 20:30:21 | jeos-main | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
8736238 | 2022-05-11 20:30:54 | jeos-extratest | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5902>: IO::Socket::INET: connect: Connection refused
8733540 | 2022-05-11 01:36:13 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733519 | 2022-05-11 01:12:48 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733486 | 2022-05-11 00:47:43 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733444 | 2022-05-11 00:23:46 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733433 | 2022-05-10 23:57:43 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733370 | 2022-05-10 23:35:48 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733348 | 2022-05-10 23:12:18 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733333 | 2022-05-10 22:48:36 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733266 | 2022-05-10 22:25:36 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733151 | 2022-05-10 22:03:57 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733143 | 2022-05-10 21:41:59 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733101 | 2022-05-10 21:20:02 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8733080 | 2022-05-10 20:55:57 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8732193 | 2022-05-10 20:33:46 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
8731378 | 2022-05-10 20:11:57 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
…
The overall fail ratio of s390x jobs (starting from this year) is also still quite high although the cases where it times out seem to be rather rare:
openqa=> with test_jobs as (select distinct id, result from jobs where arch = 's390x' and t_started >= '2022-01-01 00:00:0') select result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.result order by ratio desc;
result | ratio
--------------------+------------------------
passed | 55.1260314230812705
failed | 21.7220526732225613
softfailed | 14.3862326212275348
incomplete | 6.7395727365208545
timeout_exceeded | 0.79970611506725443653
parallel_failed | 0.64993783203345766927
obsoleted | 0.25714931615236803436
user_cancelled | 0.14129083305075166723
user_restarted | 0.10172939979654120041
none | 0.05369051655928563355
parallel_restarted | 0.02260653328812026676
(11 Zeilen)
The figures also haven't changed much compared to jobs from all time and compared to jobs from the last month and the last week.
The ratio of VNC issues (at least those mentioned in the ticket description where the job's reason is set accordingly) within failing/incompleting s390x jobs is indeed quite high:
openqa=> with finished as (select result, reason, t_finished from jobs where arch='s390x' and (result='failed' or result='incomplete')) select round(count(*) filter (where reason like '%Error connecting to VNC server%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_vnc_issues, count(*) total from finished where t_finished >= '2022-01-01';
ratio_of_vnc_issues | total
---------------------+-------
15.38 | 10105
(1 Zeile)
Note that "total" is the number of failing/incompleting s390x jobs and the ratio is the ratio of VNC issues within that set of job (not within all s390x jobs).
The problem has become much worse since last February:
openqa=> with finished as (select result, reason, t_finished from jobs where arch='s390x' and (result='failed' or result='incomplete')) select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, round(count(*) filter (where reason like '%Error connecting to VNC server%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_vnc_issues, count(*) total from finished where t_finished >= '2021-01-01' group by year, month order by year, month asc;
year | month | ratio_of_vnc_issues | total
------+-------+---------------------+-------
2021 | 1 | 2.41 | 249
2021 | 2 | 0.99 | 304
2021 | 3 | 2.97 | 202
2021 | 4 | 2.27 | 264
2021 | 5 | 0.92 | 109
2021 | 6 | 0.93 | 108
2021 | 7 | 0 | 55
2021 | 8 | 0 | 26
2021 | 9 | 0.6 | 168
2021 | 10 | 0 | 140
2021 | 11 | 0 | 480
2021 | 12 | 0 | 208
2022 | 1 | 0.22 | 455
2022 | 2 | 10.89 | 937
2022 | 3 | 15.16 | 2282
2022 | 4 | 20.66 | 4429
2022 | 5 | 9.49 | 2002
(17 Zeilen)
Note that the overall ratio of failed/incomplete jobs hasn't become much worse, though:
openqa=> with finished as (select result, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, round(count(*) filter (where result = 'failed' or result = 'incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_failures_or_incompletes, count(*) total from finished where t_finished >= '2020-01-01' group by year, month order by year, month asc;
year | month | ratio_of_all_failures_or_incompletes | total
------+-------+--------------------------------------+-------
2020 | 1 | 49.39 | 733
2020 | 2 | 80 | 30
2020 | 3 | 31.71 | 820
2020 | 4 | 26.91 | 680
2020 | 5 | 31.01 | 703
2020 | 6 | 33.33 | 516
2020 | 7 | 29.29 | 140
2020 | 8 | 30.95 | 84
2020 | 9 | 41.15 | 418
2020 | 10 | 30.73 | 563
2020 | 11 | 35.29 | 544
2020 | 12 | 30.97 | 452
2021 | 1 | 27.76 | 897
2021 | 2 | 31.9 | 953
2021 | 3 | 30.75 | 657
2021 | 4 | 26.4 | 1000
2021 | 5 | 27.95 | 390
2021 | 6 | 51.67 | 209
2021 | 7 | 35.95 | 153
2021 | 8 | 46.43 | 56
2021 | 9 | 36.76 | 457
2021 | 10 | 32.26 | 434
2021 | 11 | 33.59 | 1429
2021 | 12 | 38.66 | 538
2022 | 1 | 34.89 | 1304
2022 | 2 | 19.01 | 4930
2022 | 3 | 19.74 | 11561
2022 | 4 | 22.45 | 19731
2022 | 5 | 26.77 | 7481
Here "total" is the number of s390x jobs in a particular month. It looks like the number of jobs has more than tripled in February compared to January. Then it doubled and then almost doubled again. So just by looking at these figures one big change since February becomes quite clear: The load in terms of number of jobs has increased a lot.
Note that since April we have launched also a significant number of investigation jobs:
openqa=> with finished as (select result, test, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, round(count(*) filter (where test like '%investigate%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_investigate_jobs, count(*) filter (where test like '%investigate%') as investigate_jobs, count(*) total from finished where t_finished >= '2021-01-01' group by year, month order by year, month asc;
year | month | ratio_of_investigate_jobs | investigate_jobs | total
------+-------+---------------------------+------------------+-------
2021 | 1 | 0 | 0 | 897
2021 | 2 | 0 | 0 | 953
2021 | 3 | 0 | 0 | 657
2021 | 4 | 0 | 0 | 1000
2021 | 5 | 0 | 0 | 390
2021 | 6 | 0 | 0 | 209
2021 | 7 | 0 | 0 | 153
2021 | 8 | 0 | 0 | 56
2021 | 9 | 0 | 0 | 457
2021 | 10 | 0 | 0 | 434
2021 | 11 | 0 | 0 | 1429
2021 | 12 | 0 | 0 | 538
2022 | 1 | 0 | 0 | 1304
2022 | 2 | 0.02 | 1 | 4930
2022 | 3 | 0 | 0 | 11561
2022 | 4 | 17.87 | 3525 | 19731
2022 | 5 | 26.11 | 1956 | 7492
(17 Zeilen)
However, that doesn't explain the additional load (we've seen since February).
Updated by MDoucha over 2 years ago
mkittler wrote:
Here "total" is the number of s390x jobs in a particular month. It looks like the number of jobs has more than tripled in February compared to January. Then it doubled and then almost doubled again. So just by looking at these figures one big change since February becomes quite clear: The load in terms of number of jobs has increased a lot.
Note that this isn't a strictly s390x issue. I've had similar problems on Prague IPMI OpenQA instance, again almost exclusively on SLE-12SP5:
http://openqa.qam.suse.cz/tests/42078
http://openqa.qam.suse.cz/tests/42397
http://openqa.qam.suse.cz/tests/42404
http://openqa.qam.suse.cz/tests/42502
I've omitted a few more jobs which have no useful logs due to my mistake.
The common factor is that the tests use VNC over physical network.
Updated by mkittler over 2 years ago
Note that this isn't a strictly s390x issue.
I know. For the sake of simplicity (and because the initial ticket description says so) I reduced my further queries to s390x. My first query also shows a few svirt/x86_64 jobs.
The common factor is that the tests use VNC over physical network.
Yes, I'd definitely agree with that. So jobs using the backends s390x, ipmi and svirt are mostly affected (and the arch doesn't really matter).
Updated by mkittler over 2 years ago
It looks like the number of jobs has more than tripled in February compared to January. Then it doubled and then almost doubled again. So just by looking at these figures one big change since February becomes quite clear: The load in terms of number of jobs has increased a lot.
That's maybe just due to the cleanup. We simply cleaned up these old jobs show the figures from previous months are just lower due to that.
Updated by okurz over 2 years ago
I would like to have some kind of monitoring checks between openQA workers where we run x3270 and the s390x mainframe on where we have s390qa101.qa.suse.de s390qa102.qa.suse.de s390qa103.qa.suse.de s390qa104.qa.suse.de s390qa105.qa.suse.de s390qa106.qa.suse.de
my question would be: As the z/VM LPARs(?) are only online while tests are running and an OS can respond to pings, can we actually ping the mainframe or something to check the stability of the physical connection between the x86_64 openQA worker machine and the mainframe having the SUT VMs?
EDIT: We have discussed this topic in a mob session on 2022-05-12. We could not find good short-term mitigations expect for brute-force retrying (fail ratio of 90% means retry up to 20 times to be on the save side). The challenge with brute-force retrying in the recent "timeout_exceeded" examples like https://openqa.suse.de/tests/8735826/logfile?filename=autoinst-log.txt is that right now we think we could only detect the issue based on the string "XIO: fatal IO error 4 (Interrupted system call) on X server" which likely also appears in other cases and potentially even in passed jobs, at least in a similar way. E.g. in a passed job https://openqa.suse.de/tests/8702743/logfile?filename=autoinst-log.txt we have "XIO: fatal IO error 11 (Resource temporarily unavailable)".
In
https://suse.slack.com/archives/C02CANHLANP/p1652363255685769?thread_ts=1652355734.319689&cid=C02CANHLANP
mdoucha identified that "As for Prague IPMI jobs, the first SLE-12SP5 failure of this kind appeared on April 8th. The previous kernel build on March 7 had clean results". We continued with "Because the tests are relying on the VNC server in the SUT, did you check an older version of SLE? Like 12SP5 GM without maintenance updates?". The issue happens in the ISO installer. We don't have a reliable reproducer on an already installed system. AFAIR there is a way to disable the auto-update or point its update repo to a fake place. Since the s390x jobs show plenty of VNC failures before March 7th it's also likely that the issue got revealed by a VNC client update instead. So even on an older SLE12SP5 GM the issue is reproducible then the VNC client on the worker host could be the culprit.
So for further steps ahead:
- In #111004 we plan to increase unit test coverage for functions like
wait_screen_change
- For this issue #99345 we should focus on mitigation within the realistic cases, e.g. s390x z/VM. Either take a specific z/VM bound openQA worker instance out of production and run test cases locally with just isotovideo (and utilizing https://github.com/os-autoinst/os-autoinst#how-to-run-test-cases with the VNC client options) or a local openQA setup, reproduce the issue(s) and improve logging in place and find the root cause
- Martin Doucha will try an IPMI job with disabled installer auto update. If that fails try to reproduce s390x z/VM jobs with SLE12SP5 GM with no updates applied and auto-updating disabled
- Setup a worker with openSUSE Leap 15.2 or 15.3 with VNC related packages as of before 2022-03 to crosscheck if VNC client updates caused a regression
Updated by mgriessmeier over 2 years ago
okurz wrote:
I would like to have some kind of monitoring checks between openQA workers where we run x3270 and the s390x mainframe on where we have s390qa101.qa.suse.de s390qa102.qa.suse.de s390qa103.qa.suse.de s390qa104.qa.suse.de s390qa105.qa.suse.de s390qa106.qa.suse.de
Hmm I will think about possible solutions to that.
my question would be: As the z/VM LPARs(?) are only online while tests are running and an OS can respond to pings, can we actually ping the mainframe or something to check the stability of the physical connection between the x86_64 openQA worker machine and the mainframe having the SUT VMs?
so s390qa10[1-6] are z/VM guests aka VMs on top of z/VM running on LPAR ZP11 (s390zp11.suse.de). It is true, that the guest (VM) is only online while it is running, aka connected to a x3270 terminal in some way.
so the first checks that comes to my mind is to check if the worker can reach s390zp11.suse.de (which hopefully should always work)
secondly, you can try to connect to it via x3270 (or s3270 on a command line, and check the return code). I would need to check the full syntax and capabilities, which I probably won't be able today. The basic command to connect would be simply s3270 s390zp11.suse.de
, but there are a lot more options to look into.
Updated by MDoucha over 2 years ago
okurz wrote:
- Martin Doucha will try an IPMI job with disabled installer auto update. If that fails try to reproduce s390x z/VM jobs with SLE12SP5 GM with no updates applied and auto-updating disabled
Disabling installer auto update by booting with self_update=0
did not help:
http://openqa.qam.suse.cz/tests/42595
The installer media have last modification date in November 2019. However, some IPMI workers appear to be more prone to this issue than others. I've reproduced the VNC failure on kaylee.qam.suse.cz
and jayne.qam.suse.cz
so far. I haven't seen it happen on any other Prague IPMI machine (yet).
Updated by mkittler over 2 years ago
#111004#note-14 and further comments show that we don't have a general network problem but that it is the VNC server which is stuck. At least this applies to the scenario/problem mentioned in #99345#note-9 which I'm currently using for my investigation as it has a really high error rate.
Of course that raises the question why the VNC server is broken.
Updated by mkittler over 2 years ago
Maybe the VNC server isn't totally broken and vncviewer just cannot cope with it, see #111004#note-16.
Updated by mkittler over 2 years ago
https://github.com/os-autoinst/os-autoinst/pull/2061 improves cases mentioned in #99345#note-9.
As stated before, I'm not so sure whether these are actually network issues or VNC server issues. In my today's investigation for #111004 I couldn't see any clear networking issues, e.g. I could always connect to the VNC server. However, it seems like a connection could get stuck often afterwards. That looked more like the VNC server itself got stuck (e.g. in the thread serving that particular client/connection).
I've just been restarting https://openqa.suse.de/tests/8769221 which failed with Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
. The restarted job runs on openqaworker2 which has my changes for #111004. Let's see whether it also let's this test pass: https://openqa.suse.de/tests/8771492#live
Updated by mkittler over 2 years ago
It passed. So I've been restarting a few other others that failed similarly:
Updated by mkittler over 2 years ago
4 out of 5 restarted jobs that failed with "Connection timed out" have passed. Maybe just luck or already an improvement. At least none of the jobs got stuck and ran into a timeout.
One further workaround could be to simply increase the VNC timeout via a test setting. The default timeout is 60 seconds for remote connections and one could try to increase it, e.g. VNC_CONNECT_TIMEOUT_REMOTE=240
.
Maybe it makes also sense to make multiple connection attempts until the timeout exceeds (instead of one connection attempt with a long timeout). Implementing such a change within our VNC client should be quite simple.
Updated by mkittler over 2 years ago
- Status changed from Workable to In Progress
PR for increasing the connect timeout: https://github.com/os-autoinst/os-autoinst/pull/2064
Updated by openqa_review over 2 years ago
- Due date set to 2022-06-02
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz over 2 years ago
The PR looks fine. Do you plan to do the "multiple connecting attempts" in parallel?
Updated by mkittler over 2 years ago
- Related to action #111063: Ping monitoring for our s390z mainframes size:S added
Updated by mkittler over 2 years ago
In parallel? That was never planned. And I also think that makes things only needlessly complicated (e.g. if one attempt succeeds one had to abort other attempts). But now it does multiple connect attempts within the extended timeout in sequence. Let's see whether it helps to reduce the number of those incompletes.
Updated by okurz over 2 years ago
mkittler wrote:
In parallel? That was never planned.
No I meant that just you would work on something in parallel, not the code :D
Updated by mkittler over 2 years ago
The extended timeout has been deployed since 20.05.22 07:23 CEST and the timeout for reads/writes on the VNC socket has been deployed since 18.05.22 07:17 CEST.
I suppose the problem of s390x jobs getting stuck and running into timeouts is resolved:
openqa=> with finished as (select result, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where result = 'timeout_exceeded') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_timeouts, count(*) total from finished where t_finished >= '2022-05-15' group by year, month, day order by year, month, day asc;
year | month | day | ratio_of_all_timeouts | total
------+-------+-----+-----------------------+-------
2022 | 5 | 15 | 0 | 404
2022 | 5 | 16 | 2.76 | 507
2022 | 5 | 17 | 0 | 545
2022 | 5 | 18 | 0.88 | 457
2022 | 5 | 19 | 0 | 457
2022 | 5 | 20 | 0 | 187
(6 Zeilen)
Note that the number of s390x jobs with VNC issues isn't that high overall:
openqa=> with finished as (select reason, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_vnc_connect_errors, count(*) total from finished where t_finished >= '2022-05-01' group by year, month, day order by year, month, day asc;
year | month | day | ratio_of_all_vnc_connect_errors | total
------+-------+-----+---------------------------------+-------
2022 | 5 | 1 | 0 | 140
2022 | 5 | 2 | 0.45 | 444
2022 | 5 | 3 | 0.3 | 657
2022 | 5 | 4 | 0.42 | 709
2022 | 5 | 5 | 1.63 | 676
2022 | 5 | 6 | 1.11 | 1079
2022 | 5 | 7 | 4.71 | 552
2022 | 5 | 8 | 0.25 | 405
2022 | 5 | 9 | 6.77 | 812
2022 | 5 | 10 | 5.2 | 1249
2022 | 5 | 11 | 1.8 | 333
2022 | 5 | 12 | 6.33 | 837
2022 | 5 | 13 | 4.03 | 1067
2022 | 5 | 14 | 1.07 | 466
2022 | 5 | 15 | 0.74 | 404
2022 | 5 | 16 | 1.97 | 507
2022 | 5 | 17 | 0.37 | 545
2022 | 5 | 18 | 0 | 457
2022 | 5 | 19 | 0.22 | 457
2022 | 5 | 20 | 0 | 192
There was actually only one occurrence within the last three days anymore:
openqa=> with finished as (select result, reason, t_finished from jobs where arch='s390x' and (result='incomplete')) select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC server%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_vnc_issues_within_incompletes, count(*) total_incompletes from finished where t_finished >= '2022-05-01' group by year, month, day order by year, month, day asc;
year | month | day | ratio_of_vnc_issues_within_incompletes | total_incompletes
------+-------+-----+----------------------------------------+-------------------
2022 | 5 | 1 | 0 | 1
2022 | 5 | 2 | 50 | 4
2022 | 5 | 3 | 50 | 4
2022 | 5 | 4 | 60 | 5
2022 | 5 | 5 | 100 | 9
2022 | 5 | 6 | 84.62 | 13
2022 | 5 | 7 | 78.13 | 32
2022 | 5 | 8 | 8.33 | 12
2022 | 5 | 9 | 82.81 | 64
2022 | 5 | 10 | 100 | 65
2022 | 5 | 11 | 100 | 6
2022 | 5 | 12 | 65.43 | 81
2022 | 5 | 13 | 87.5 | 48
2022 | 5 | 14 | 71.43 | 7
2022 | 5 | 15 | 10.34 | 29
2022 | 5 | 16 | 80 | 10
2022 | 5 | 17 | 100 | 2
2022 | 5 | 18 | 0 | 13
2022 | 5 | 19 | 100 | 1
2022 | 5 | 20 | 0 | 2
Looking at all jobs, there where only 5 jobs with VNC issues in the last three days and only one of them was an actual timeout:
penqa=> select id, t_finished, test, (select value from job_settings where job_id = jobs.id and key = 'BACKEND') as backend, arch, reason from jobs where reason like '%Error connecting to VNC server%' order by t_finished desc limit 10;
id | t_finished | test | backend | arch | reason
---------+---------------------+-----------------------------------------------------------------------------------------------------------+---------+--------+----------------------------------------------------------------------------------------------------------------------------
8793234 | 2022-05-20 10:04:47 | jeos-containers:investigate:last_good_tests_and_build:1d3a5e7f5d2aa21788b5aacf7d2bde6af6fcbd78+20220518-1 | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused
8793855 | 2022-05-20 10:04:42 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
8787574 | 2022-05-19 08:48:28 | offline_sles12sp3_ltss_rmt_pcm-contm_def_full | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: timeout
8773498 | 2022-05-18 01:28:07 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
8773500 | 2022-05-18 01:27:40 | jeos-filesystem | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused
8774388 | 2022-05-17 23:10:07 | qam-minimal+base | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out
8771988 | 2022-05-17 15:08:36 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0 | s390x | s390x | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
8768593 | 2022-05-16 23:11:32 | qam-minimal+base | svirt | s390x | backend done: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out
8768591 | 2022-05-16 23:10:48 | mru-install-minimal-with-addons | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.93:5901>: IO::Socket::INET: connect: Connection timed out
8769221 | 2022-05-16 21:08:11 | qam-minimal-full | s390x | s390x | backend done: Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
(10 Zeilen)
In https://openqa.suse.de/tests/8793234/logfile?filename=autoinst-log.txt we can see that the increased timeout of 4 minutes is effective. I doubt that extending the timeout further would be helpful (the error here is "Connection refused" so the VNC server was likely not running anymore). I suppose this is actually a different issue (and we have #76813 for that).
https://openqa.suse.de/tests/8787574/logfile?filename=autoinst-log.txt ran before increasing the VNC timeout any it might have helped:
[2022-05-19T10:48:11.285168+02:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: timeout
[2022-05-19T10:48:11.285945+02:00] [debug] Closing SSH serial connection with s390zp18.suse.de
I'll check again next week.
Updated by okurz over 2 years ago
Please keep in mind that SLE maintenance tests run much less s390x jobs and SLE validation for new builds has slowed down significantly after SLE15SP4 GMC release so likely you are better off with explicitly triggered tests to get a good statistic.
Updated by mkittler over 2 years ago
Ok, I've been closing the last 100 passing s390x jobs on OSD: https://openqa.suse.de/tests/overview?build=test-s390x
Let's see how high the error rate will be.
Updated by mkittler over 2 years ago
Most of the jobs I've spawned were obsoleted:
openqa=> with test_jobs as (select distinct id, state, result from jobs where build = 'test-s390x') select state, result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.state, test_jobs.result order by ratio desc;
state | result | ratio
-----------+-----------------+------------------------
cancelled | obsoleted | 76.4705882352941176
done | passed | 16.6666666666666667
done | failed | 3.9215686274509804
done | parallel_failed | 1.9607843137254902
done | softfailed | 0.98039215686274509804
(5 Zeilen)
However, the jobs that ran mostly passed and the few failures were not due to VNC connection issues. In fact, none of the > 1200 s390x jobs ran into the VNC error:
openqa=> with finished as (select reason, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_vnc_connect_errors, count(*) total from finished where t_finished >= '2022-05-19' group by year, month, day order by year, month, day asc;
year | month | day | ratio_of_all_vnc_connect_errors | total
------+-------+-----+---------------------------------+-------
2022 | 5 | 19 | 0.22 | 457
2022 | 5 | 20 | 0 | 570
2022 | 5 | 21 | 0 | 508
2022 | 5 | 22 | 0 | 102
2022 | 5 | 23 | 0 | 98
(5 Zeilen)
So I would conclude that the issue isn't very apparent anymore.
Updated by mkittler over 2 years ago
- Status changed from In Progress to Feedback
- Priority changed from Urgent to High
Updated by okurz over 2 years ago
mkittler wrote:
So I would conclude that the issue isn't very apparent anymore.
We wouldn't gain a lot if the issue just become less probable because that would also reproducing and fixing harder. So we should be sure that the same issue realistically can not happen anymore. Of course connections could still time out but if that happens then it should be really clear to the test reviewers what to do. And they should not feel like they need to reopen this ticket or (even worse) just manually restart because they assume it's basically still the same issue.
Updated by mkittler over 2 years ago
- Due date changed from 2022-06-02 to 2022-06-17
I'll be on vacation as of tomorrow. I'll check statistics again when I'm back.
Updated by mkittler over 2 years ago
So far it already looks good. Not a single more job with Error connecting to VNC
.
Updated by mkittler over 2 years ago
We had some more VNC issues but the ratio of them is quite low:
openqa=> with finished as (select reason, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_vnc_connect_errors, count(*) total from finished where t_finished >= '2022-05-19' group by year, month, day order by year, month, day asc;
year | month | day | ratio_of_all_vnc_connect_errors | total
------+-------+-----+---------------------------------+-------
2022 | 5 | 19 | 0.22 | 446
2022 | 5 | 20 | 0 | 564
2022 | 5 | 21 | 0 | 499
2022 | 5 | 22 | 0 | 96
2022 | 5 | 23 | 0 | 372
2022 | 5 | 24 | 0 | 229
2022 | 5 | 25 | 0 | 383
2022 | 5 | 26 | 0 | 340
2022 | 5 | 27 | 0 | 339
2022 | 5 | 28 | 0 | 180
2022 | 5 | 29 | 0 | 283
2022 | 5 | 30 | 0 | 483
2022 | 5 | 31 | 0 | 437
2022 | 6 | 1 | 0 | 517
2022 | 6 | 2 | 0 | 370
2022 | 6 | 3 | 0.13 | 747
2022 | 6 | 4 | 0 | 356
2022 | 6 | 5 | 0.24 | 410
2022 | 6 | 6 | 0 | 261
2022 | 6 | 7 | 0 | 364
2022 | 6 | 8 | 0 | 447
2022 | 6 | 9 | 0 | 467
(22 Zeilen)
Those are the relevant jobs:
openqa=> select id, t_finished, test, (select value from job_settings where job_id = jobs.id and key = 'BACKEND') as backend, arch, reason from jobs where reason like '%Error connecting to VNC server%' order by t_finished desc limit 10;
id | t_finished | test | backend | arch | reason
---------+---------------------+-----------------------------------------------------------------------------------------------------------+---------+--------+----------------------------------------------------------------------------------------------------------------------------
8888867 | 2022-06-05 10:49:20 | sle-micro_autoyast | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.83:5901>: IO::Socket::INET: connect: timeout
8871077 | 2022-06-03 09:37:44 | qam-minimal-full | s390x | s390x | backend died: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
8870924 | 2022-06-03 08:42:54 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
8870930 | 2022-06-03 08:42:53 | jeos-fips | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5902>: IO::Socket::INET: connect: Connection refused
8830494 | 2022-05-27 05:59:24 | jeos-fs_stress | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5902>: IO::Socket::INET: connect: Connection refused
8830463 | 2022-05-27 05:59:19 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused
8793234 | 2022-05-20 10:04:47 | jeos-containers:investigate:last_good_tests_and_build:1d3a5e7f5d2aa21788b5aacf7d2bde6af6fcbd78+20220518-1 | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused
8793855 | 2022-05-20 10:04:42 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
8787574 | 2022-05-19 08:48:28 | offline_sles12sp3_ltss_rmt_pcm-contm_def_full | svirt | s390x | backend died: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: timeout
8773498 | 2022-05-18 01:28:07 | jeos-containers | svirt | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
(10 Zeilen)
Job 8888867 actually shows a VNC-related error during the installation (see https://openqa.suse.de/tests/8888867#step/installation/2) and that error is not present in the previous run (see https://openqa.suse.de/tests/8909978#step/installation/3) so it might actually be a product bug.
The logs of job 8871077 show that the timeout and re-connect attempts work as expected but were still not enough (so presumably the VNC server got stuck/stopped/crashed):
[2022-06-03T11:33:36.162953+02:00] [warn] !!! consoles::VNC::update_framebuffer: Error in VNC protocol - relogin: short read for zrle data 991 - 2244
[2022-06-03T11:34:38.221298+02:00] [warn] !!! consoles::VNC::login: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
[2022-06-03T11:35:39.273462+02:00] [warn] !!! consoles::VNC::login: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
[2022-06-03T11:36:40.319105+02:00] [warn] !!! consoles::VNC::login: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
[2022-06-03T11:37:41.382654+02:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
The same counts for 8870924 and the further jobs. Not sure whether we can do much about it and why the VNC server on the other end doesn't work anymore. I'm also wondering what the difference between the timeouts and the refused connections is. Maybe in case of the timeouts the VNC server on the other end is technically still listening but stuck? Looks like s390x jobs are normally running into the timeout and x86_64 jobs into connection refused. So those are likely distinct problems.
Updated by mkittler about 2 years ago
Only 3 new jobs have failed for the last 4 days (https://openqa.suse.de/tests/8939209, https://openqa.suse.de/tests/8928450, https://openqa.suse.de/tests/8928436). That's still a small percentage. I'm wondering whether that's already acceptable. If not I suppose I'd try to debug the VNC servers themselves because the retry on the client-side should already cover client or connection problems. I just have no idea how to debug the VNC servers. (The svirt/s390x and svirt/x86_64 cases seem to be different as there's also a slightly different error message and the VNC server is maybe provided differently.)
Updated by okurz about 2 years ago
mkittler wrote:
Only 3 new jobs have failed for the last 4 days (https://openqa.suse.de/tests/8939209, https://openqa.suse.de/tests/8928450, https://openqa.suse.de/tests/8928436). That's still a small percentage. I'm wondering whether that's already acceptable.
Just put yourself into the shoes of the (poor) openQA test maintainers or maybe a package maintainer that learns that their package update is not accepted due to any of the above test failures. What would you do? Who should handle such failures?
Updated by mkittler about 2 years ago
The only thing I can improve right now is having those jobs restarted automatically: https://github.com/os-autoinst/openQA/pull/4702
I actually thought it would already be the case the the current regex isn't matching the remaining cases anymore.
Updated by okurz about 2 years ago
Updated by okurz about 2 years ago
- Due date deleted (
2022-06-17) - Status changed from Feedback to Resolved
With that merged and deployed and openqa-query-for-job-label poo#99345
only showing
8871077|2022-06-03 09:37:44|done|incomplete|qam-minimal-full|backend died: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout|openqaworker2
8769221|2022-05-16 21:08:11|done|failed|qam-minimal-full|backend done: Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: timeout|openqaworker2
we assume that this is resolved by now even though we did not dig deeper into VNC server debugging. Anyway, for new specific problem observations please open a separate, dedicate ticket