Project

General

Profile

Actions

action #99345

closed

action #107062: Multiple failures due to network issues

[tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry size:M

Added by vsvecova about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Start date:
2021-09-27
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP4-Server-DVD-Updates-s390x-mru-install-minimal-with-addons@s390x-kvm-sle12 incomplete, stops at
start_install

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

Reproducible

Fails since (at least) Build 20210927-1 (current job)

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#99345

Expected result

Last good: 20210925-1 (or more recent)

Acceptance criteria

  • AC1: The root cause of the problem is known
  • AC2: The next steps are known and have been initiated

Suggestions

  • Talk to all the people involved to get the full story

Further details

Always latest result in this scenario: latest

IPMI and s390x workers keep losing VNC connection during SLES installation and the reconnect attempt gets stuck for some strange reason until the job hits MAX_JOB_TIME:

[2022-05-11T13:46:24.602260+02:00] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
XIO:  fatal IO error 4 (Interrupted system call) on X server ":37191"
      after 23426 requests (23426 known processed) with 0 events remaining.
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":34867"
      after 28852 requests (28852 known processed) with 0 events remaining.
[2022-05-11T15:36:53.708309+02:00] [debug] autotest received signal TERM, saving results of current test before exiting
[2022-05-11T15:36:53.708518+02:00] [debug] isotovideo received signal TERM
[2022-05-11T15:36:53.708516+02:00] [debug] backend got TERM

Note that the job spent 110 minutes in wait_screen_change() that was supposed to time out after 10 seconds.

In another job it was stuck on assert_screen:

[2022-05-11T15:58:59.169408+02:00] [debug] <<< testapi::assert_screen(mustmatch="installation", no_wait=1, timeout=30)[2022-05-11T16:15:16.486959+02:00] [warn] !!! consoles::VNC::catch {...} : Error in VNC protocol - relogin: short read for zrle data 659 - 950[2022-05-11T21:47:05.966962+02:00] [debug] backend got TERM[2022-05-11T21:47:05.966980+02:00] [debug] isotovideo received signal TERM[2022-05-11T21:47:05.967084+02:00] [debug] autotest received signal TERM, saving results of current test before exiting
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":44341"
      after 28689 requests (28689 known processed) with 0 events remaining.
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":60785"
      after 39307 requests (39307 known processed) with 0 events remaining.

Here we've even got a VNC error and the VNC client would try to re-login but I suppose it is pointless because the VNC server terminates when the connection is lost anyways. So for a real retry we likely needed to also restart the VNC server. (Note that @MDoucha tried to connect manually here, see https://progress.opensuse.org/issues/99345#note-9)


Related issues 5 (1 open4 closed)

Related to openQA Project (public) - action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused"New2020-10-30

Actions
Related to openQA Project (public) - action #111004: Timeout of test API functions not enforced if backend gets stuck, e.g. on the VNC socket size:MResolvedmkittler2022-05-122022-05-28

Actions
Related to openQA Project (public) - coordination #109656: [epic] Stable non-qemu backendsResolvedokurz2021-12-29

Actions
Related to openQA Infrastructure (public) - action #111063: Ping monitoring for our s390z mainframes size:SResolvedokurz2022-05-13

Actions
Has duplicate openQA Tests (public) - action #110902: [qe-core]qam-minimal-full@s390x-zVM-vswitch-l3 multiple failuresRejectedmgrifalconi

Actions
Actions #1

Updated by livdywan almost 3 years ago

  • Is duplicate of action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused" added
Actions #2

Updated by okurz almost 3 years ago

  • Subject changed from [qem] Incomplete test runs on s390x with `backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out` to [tools][qem] Incomplete test runs on s390x with `backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out`
  • Target version set to Ready
Actions #3

Updated by okurz almost 3 years ago

  • Is duplicate of deleted (action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused")
Actions #4

Updated by okurz almost 3 years ago

  • Related to action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused" added
Actions #5

Updated by okurz almost 3 years ago

  • Status changed from New to Blocked
  • Assignee set to mkittler
Actions #6

Updated by okurz over 2 years ago

  • Subject changed from [tools][qem] Incomplete test runs on s390x with `backend died: Error connecting to VNC server <10.161.145.95:5901>: IO::Socket::INET: connect: Connection timed out` to [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry
  • Description updated (diff)
  • Priority changed from Normal to High

I called export host=openqa.suse.de; bash -ex ./openqa-monitor-incompletes | bash -e ./openqa-label-known-issues for #108896 and found 63 unknown issues, most of them about "Error connecting to VNC server.*s390" so I bump prio, adapt the subject to act as auto-review control words and added the template text using openqa-query-for-job-label. Now I will call the above openqa-label-known-issues command again.

Actions #7

Updated by okurz over 2 years ago

  • Parent task set to #107062
Actions #8

Updated by okurz over 2 years ago

  • Priority changed from High to Low

We decided to work on #107026 first which I will change in prio accordingly.

Actions #9

Updated by MDoucha over 2 years ago

This issue is currently blocking auto-approval of nearly all SLE-12SP5 maintenance updates. See https://openqa.suse.de/tests/8735098#next_previous

IPMI and s390x workers keep losing VNC connection during SLES installation and the reconnect attempt gets stuck for some strange reason until the job hits MAX_JOB_TIME. The logs are not very helpful, reconnect logging could certainly be improved for a start:
https://openqa.suse.de/tests/8735098/logfile?filename=autoinst-log.txt

[2022-05-11T13:46:24.602260+02:00] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
XIO:  fatal IO error 4 (Interrupted system call) on X server ":37191"
      after 23426 requests (23426 known processed) with 0 events remaining.
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":34867"
      after 28852 requests (28852 known processed) with 0 events remaining.
[2022-05-11T15:36:53.708309+02:00] [debug] autotest received signal TERM, saving results of current test before exiting
[2022-05-11T15:36:53.708518+02:00] [debug] isotovideo received signal TERM
[2022-05-11T15:36:53.708516+02:00] [debug] backend got TERM

Note that the job spent 110 minutes in wait_screen_change() that was supposed to time out after 10 seconds.

I've checked the job groups of other SLE versions and it seems that only SLE-12SP5 is affected. I've also tried connecting via VNC to one running job that was stuck for too long and the VNC server was running and responding. It was this job, specifically: http://openqa.qam.suse.cz/tests/42502

Actions #10

Updated by okurz over 2 years ago

  • Status changed from Blocked to New
  • Priority changed from Low to Urgent

#107026 was resolved so good time to look into this with urgency then.

Actions #11

Updated by mgrifalconi over 2 years ago

  • Has duplicate action #110902: [qe-core]qam-minimal-full@s390x-zVM-vswitch-l3 multiple failures added
Actions #12

Updated by livdywan over 2 years ago

  • Related to action #111004: Timeout of test API functions not enforced if backend gets stuck, e.g. on the VNC socket size:M added
Actions #13

Updated by okurz over 2 years ago

Actions #14

Updated by livdywan over 2 years ago

  • Subject changed from [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry to [tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #15

Updated by mkittler over 2 years ago

We'll look into the wait_screen_change issue separately (see #111004). It is still not clear what's the source of the initially reported issue which also happens under SLE-12-SP4.


However, it it definitely still happening (as well as #76813):

openqa=> select id, t_finished, test, (select value from job_settings where job_id = jobs.id and key = 'BACKEND') as backend, arch, reason from jobs where reason like '%Error connecting to VNC server%' order by id desc limit 40;
   id    |     t_finished      |                              test                               | backend |  arch  |                                                           reason                                                           
---------+---------------------+-----------------------------------------------------------------+---------+--------+----------------------------------------------------------------------------------------------------------------------------
 8742546 | 2022-05-12 07:33:59 | mru-install-minimal-with-addons                                 | svirt   | s390x  | backend died: Error connecting to VNC server <10.161.145.81:5901>: IO::Socket::INET: connect: Connection timed out
 8742530 | 2022-05-12 07:26:57 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8742413 | 2022-05-12 06:54:57 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8741962 | 2022-05-12 06:54:54 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation        | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8741951 | 2022-05-12 06:21:37 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8741950 | 2022-05-12 06:19:59 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_ssh_pre            | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8741935 | 2022-05-12 05:47:10 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation_auto_1 | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8738659 | 2022-05-11 23:10:32 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_preparation        | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8738614 | 2022-05-11 21:24:07 | offline_sles12sp5_pscc_asmm-lgm_all_full_zVM_ssh_pre            | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8736240 | 2022-05-11 20:30:21 | jeos-main                                                       | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
 8736238 | 2022-05-11 20:30:54 | jeos-extratest                                                  | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5902>: IO::Socket::INET: connect: Connection refused
 8733540 | 2022-05-11 01:36:13 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733519 | 2022-05-11 01:12:48 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733486 | 2022-05-11 00:47:43 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733444 | 2022-05-11 00:23:46 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733433 | 2022-05-10 23:57:43 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733370 | 2022-05-10 23:35:48 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733348 | 2022-05-10 23:12:18 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733333 | 2022-05-10 22:48:36 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733266 | 2022-05-10 22:25:36 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733151 | 2022-05-10 22:03:57 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa101.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733143 | 2022-05-10 21:41:59 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733101 | 2022-05-10 21:20:02 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8733080 | 2022-05-10 20:55:57 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8732193 | 2022-05-10 20:33:46 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
 8731378 | 2022-05-10 20:11:57 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0              | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: Connection timed out
…

The overall fail ratio of s390x jobs (starting from this year) is also still quite high although the cases where it times out seem to be rather rare:

openqa=> with test_jobs as (select distinct id, result from jobs where arch = 's390x' and t_started >= '2022-01-01 00:00:0') select result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.result order by ratio desc;
       result       |         ratio          
--------------------+------------------------
 passed             |    55.1260314230812705
 failed             |    21.7220526732225613
 softfailed         |    14.3862326212275348
 incomplete         |     6.7395727365208545
 timeout_exceeded   | 0.79970611506725443653
 parallel_failed    | 0.64993783203345766927
 obsoleted          | 0.25714931615236803436
 user_cancelled     | 0.14129083305075166723
 user_restarted     | 0.10172939979654120041
 none               | 0.05369051655928563355
 parallel_restarted | 0.02260653328812026676
(11 Zeilen)

The figures also haven't changed much compared to jobs from all time and compared to jobs from the last month and the last week.

The ratio of VNC issues (at least those mentioned in the ticket description where the job's reason is set accordingly) within failing/incompleting s390x jobs is indeed quite high:

openqa=> with finished as (select result, reason, t_finished from jobs where arch='s390x' and (result='failed' or result='incomplete')) select round(count(*) filter (where reason like '%Error connecting to VNC server%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_vnc_issues, count(*) total from finished where t_finished >= '2022-01-01';
 ratio_of_vnc_issues | total 
---------------------+-------
               15.38 | 10105
(1 Zeile)

Note that "total" is the number of failing/incompleting s390x jobs and the ratio is the ratio of VNC issues within that set of job (not within all s390x jobs).

The problem has become much worse since last February:

openqa=> with finished as (select result, reason, t_finished from jobs where arch='s390x' and (result='failed' or result='incomplete')) select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, round(count(*) filter (where reason like '%Error connecting to VNC server%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_vnc_issues, count(*) total from finished where t_finished >= '2021-01-01' group by year, month order by year, month asc;
 year | month | ratio_of_vnc_issues | total 
------+-------+---------------------+-------
 2021 |     1 |                2.41 |   249
 2021 |     2 |                0.99 |   304
 2021 |     3 |                2.97 |   202
 2021 |     4 |                2.27 |   264
 2021 |     5 |                0.92 |   109
 2021 |     6 |                0.93 |   108
 2021 |     7 |                   0 |    55
 2021 |     8 |                   0 |    26
 2021 |     9 |                 0.6 |   168
 2021 |    10 |                   0 |   140
 2021 |    11 |                   0 |   480
 2021 |    12 |                   0 |   208
 2022 |     1 |                0.22 |   455
 2022 |     2 |               10.89 |   937
 2022 |     3 |               15.16 |  2282
 2022 |     4 |               20.66 |  4429
 2022 |     5 |                9.49 |  2002
(17 Zeilen)

Note that the overall ratio of failed/incomplete jobs hasn't become much worse, though:

openqa=> with finished as (select result, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, round(count(*) filter (where result = 'failed' or result = 'incomplete') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_failures_or_incompletes, count(*) total from finished where t_finished >= '2020-01-01' group by year, month order by year, month asc;
 year | month | ratio_of_all_failures_or_incompletes | total 
------+-------+--------------------------------------+-------
 2020 |     1 |                                49.39 |   733
 2020 |     2 |                                   80 |    30
 2020 |     3 |                                31.71 |   820
 2020 |     4 |                                26.91 |   680
 2020 |     5 |                                31.01 |   703
 2020 |     6 |                                33.33 |   516
 2020 |     7 |                                29.29 |   140
 2020 |     8 |                                30.95 |    84
 2020 |     9 |                                41.15 |   418
 2020 |    10 |                                30.73 |   563
 2020 |    11 |                                35.29 |   544
 2020 |    12 |                                30.97 |   452
 2021 |     1 |                                27.76 |   897
 2021 |     2 |                                 31.9 |   953
 2021 |     3 |                                30.75 |   657
 2021 |     4 |                                 26.4 |  1000
 2021 |     5 |                                27.95 |   390
 2021 |     6 |                                51.67 |   209
 2021 |     7 |                                35.95 |   153
 2021 |     8 |                                46.43 |    56
 2021 |     9 |                                36.76 |   457
 2021 |    10 |                                32.26 |   434
 2021 |    11 |                                33.59 |  1429
 2021 |    12 |                                38.66 |   538
 2022 |     1 |                                34.89 |  1304
 2022 |     2 |                                19.01 |  4930
 2022 |     3 |                                19.74 | 11561
 2022 |     4 |                                22.45 | 19731
 2022 |     5 |                                26.77 |  7481

Here "total" is the number of s390x jobs in a particular month. It looks like the number of jobs has more than tripled in February compared to January. Then it doubled and then almost doubled again. So just by looking at these figures one big change since February becomes quite clear: The load in terms of number of jobs has increased a lot.

Note that since April we have launched also a significant number of investigation jobs:

openqa=> with finished as (select result, test, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, round(count(*) filter (where test like '%investigate%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_investigate_jobs, count(*) filter (where test like '%investigate%') as investigate_jobs, count(*) total from finished where t_finished >= '2021-01-01' group by year, month order by year, month asc;
 year | month | ratio_of_investigate_jobs | investigate_jobs | total 
------+-------+---------------------------+------------------+-------
 2021 |     1 |                         0 |                0 |   897
 2021 |     2 |                         0 |                0 |   953
 2021 |     3 |                         0 |                0 |   657
 2021 |     4 |                         0 |                0 |  1000
 2021 |     5 |                         0 |                0 |   390
 2021 |     6 |                         0 |                0 |   209
 2021 |     7 |                         0 |                0 |   153
 2021 |     8 |                         0 |                0 |    56
 2021 |     9 |                         0 |                0 |   457
 2021 |    10 |                         0 |                0 |   434
 2021 |    11 |                         0 |                0 |  1429
 2021 |    12 |                         0 |                0 |   538
 2022 |     1 |                         0 |                0 |  1304
 2022 |     2 |                      0.02 |                1 |  4930
 2022 |     3 |                         0 |                0 | 11561
 2022 |     4 |                     17.87 |             3525 | 19731
 2022 |     5 |                     26.11 |             1956 |  7492
(17 Zeilen)

However, that doesn't explain the additional load (we've seen since February).

Actions #16

Updated by MDoucha over 2 years ago

mkittler wrote:

Here "total" is the number of s390x jobs in a particular month. It looks like the number of jobs has more than tripled in February compared to January. Then it doubled and then almost doubled again. So just by looking at these figures one big change since February becomes quite clear: The load in terms of number of jobs has increased a lot.

Note that this isn't a strictly s390x issue. I've had similar problems on Prague IPMI OpenQA instance, again almost exclusively on SLE-12SP5:
http://openqa.qam.suse.cz/tests/42078
http://openqa.qam.suse.cz/tests/42397
http://openqa.qam.suse.cz/tests/42404
http://openqa.qam.suse.cz/tests/42502

I've omitted a few more jobs which have no useful logs due to my mistake.

The common factor is that the tests use VNC over physical network.

Actions #17

Updated by mkittler over 2 years ago

Note that this isn't a strictly s390x issue.

I know. For the sake of simplicity (and because the initial ticket description says so) I reduced my further queries to s390x. My first query also shows a few svirt/x86_64 jobs.

The common factor is that the tests use VNC over physical network.

Yes, I'd definitely agree with that. So jobs using the backends s390x, ipmi and svirt are mostly affected (and the arch doesn't really matter).

Actions #18

Updated by mkittler over 2 years ago

It looks like the number of jobs has more than tripled in February compared to January. Then it doubled and then almost doubled again. So just by looking at these figures one big change since February becomes quite clear: The load in terms of number of jobs has increased a lot.

That's maybe just due to the cleanup. We simply cleaned up these old jobs show the figures from previous months are just lower due to that.

Actions #19

Updated by okurz over 2 years ago

I would like to have some kind of monitoring checks between openQA workers where we run x3270 and the s390x mainframe on where we have s390qa101.qa.suse.de s390qa102.qa.suse.de s390qa103.qa.suse.de s390qa104.qa.suse.de s390qa105.qa.suse.de s390qa106.qa.suse.de
my question would be: As the z/VM LPARs(?) are only online while tests are running and an OS can respond to pings, can we actually ping the mainframe or something to check the stability of the physical connection between the x86_64 openQA worker machine and the mainframe having the SUT VMs?

EDIT: We have discussed this topic in a mob session on 2022-05-12. We could not find good short-term mitigations expect for brute-force retrying (fail ratio of 90% means retry up to 20 times to be on the save side). The challenge with brute-force retrying in the recent "timeout_exceeded" examples like https://openqa.suse.de/tests/8735826/logfile?filename=autoinst-log.txt is that right now we think we could only detect the issue based on the string "XIO: fatal IO error 4 (Interrupted system call) on X server" which likely also appears in other cases and potentially even in passed jobs, at least in a similar way. E.g. in a passed job https://openqa.suse.de/tests/8702743/logfile?filename=autoinst-log.txt we have "XIO: fatal IO error 11 (Resource temporarily unavailable)".

In
https://suse.slack.com/archives/C02CANHLANP/p1652363255685769?thread_ts=1652355734.319689&cid=C02CANHLANP
mdoucha identified that "As for Prague IPMI jobs, the first SLE-12SP5 failure of this kind appeared on April 8th. The previous kernel build on March 7 had clean results". We continued with "Because the tests are relying on the VNC server in the SUT, did you check an older version of SLE? Like 12SP5 GM without maintenance updates?". The issue happens in the ISO installer. We don't have a reliable reproducer on an already installed system. AFAIR there is a way to disable the auto-update or point its update repo to a fake place. Since the s390x jobs show plenty of VNC failures before March 7th it's also likely that the issue got revealed by a VNC client update instead. So even on an older SLE12SP5 GM the issue is reproducible then the VNC client on the worker host could be the culprit.

So for further steps ahead:

  1. In #111004 we plan to increase unit test coverage for functions like wait_screen_change
  2. For this issue #99345 we should focus on mitigation within the realistic cases, e.g. s390x z/VM. Either take a specific z/VM bound openQA worker instance out of production and run test cases locally with just isotovideo (and utilizing https://github.com/os-autoinst/os-autoinst#how-to-run-test-cases with the VNC client options) or a local openQA setup, reproduce the issue(s) and improve logging in place and find the root cause
  3. Martin Doucha will try an IPMI job with disabled installer auto update. If that fails try to reproduce s390x z/VM jobs with SLE12SP5 GM with no updates applied and auto-updating disabled
  4. Setup a worker with openSUSE Leap 15.2 or 15.3 with VNC related packages as of before 2022-03 to crosscheck if VNC client updates caused a regression
Actions #20

Updated by mgriessmeier over 2 years ago

okurz wrote:

I would like to have some kind of monitoring checks between openQA workers where we run x3270 and the s390x mainframe on where we have s390qa101.qa.suse.de s390qa102.qa.suse.de s390qa103.qa.suse.de s390qa104.qa.suse.de s390qa105.qa.suse.de s390qa106.qa.suse.de

Hmm I will think about possible solutions to that.

my question would be: As the z/VM LPARs(?) are only online while tests are running and an OS can respond to pings, can we actually ping the mainframe or something to check the stability of the physical connection between the x86_64 openQA worker machine and the mainframe having the SUT VMs?

so s390qa10[1-6] are z/VM guests aka VMs on top of z/VM running on LPAR ZP11 (s390zp11.suse.de). It is true, that the guest (VM) is only online while it is running, aka connected to a x3270 terminal in some way.
so the first checks that comes to my mind is to check if the worker can reach s390zp11.suse.de (which hopefully should always work)
secondly, you can try to connect to it via x3270 (or s3270 on a command line, and check the return code). I would need to check the full syntax and capabilities, which I probably won't be able today. The basic command to connect would be simply s3270 s390zp11.suse.de, but there are a lot more options to look into.

Actions #21

Updated by MDoucha over 2 years ago

okurz wrote:

  1. Martin Doucha will try an IPMI job with disabled installer auto update. If that fails try to reproduce s390x z/VM jobs with SLE12SP5 GM with no updates applied and auto-updating disabled

Disabling installer auto update by booting with self_update=0 did not help:
http://openqa.qam.suse.cz/tests/42595

The installer media have last modification date in November 2019. However, some IPMI workers appear to be more prone to this issue than others. I've reproduced the VNC failure on kaylee.qam.suse.cz and jayne.qam.suse.cz so far. I haven't seen it happen on any other Prague IPMI machine (yet).

Actions #22

Updated by mkittler over 2 years ago

#111004#note-14 and further comments show that we don't have a general network problem but that it is the VNC server which is stuck. At least this applies to the scenario/problem mentioned in #99345#note-9 which I'm currently using for my investigation as it has a really high error rate.

Of course that raises the question why the VNC server is broken.

Actions #23

Updated by mkittler over 2 years ago

Maybe the VNC server isn't totally broken and vncviewer just cannot cope with it, see #111004#note-16.

Actions #24

Updated by mkittler over 2 years ago

https://github.com/os-autoinst/os-autoinst/pull/2061 improves cases mentioned in #99345#note-9.

As stated before, I'm not so sure whether these are actually network issues or VNC server issues. In my today's investigation for #111004 I couldn't see any clear networking issues, e.g. I could always connect to the VNC server. However, it seems like a connection could get stuck often afterwards. That looked more like the VNC server itself got stuck (e.g. in the thread serving that particular client/connection).

I've just been restarting https://openqa.suse.de/tests/8769221 which failed with Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: timeout. The restarted job runs on openqaworker2 which has my changes for #111004. Let's see whether it also let's this test pass: https://openqa.suse.de/tests/8771492#live

Actions #26

Updated by mkittler over 2 years ago

4 out of 5 restarted jobs that failed with "Connection timed out" have passed. Maybe just luck or already an improvement. At least none of the jobs got stuck and ran into a timeout.


One further workaround could be to simply increase the VNC timeout via a test setting. The default timeout is 60 seconds for remote connections and one could try to increase it, e.g. VNC_CONNECT_TIMEOUT_REMOTE=240.

Maybe it makes also sense to make multiple connection attempts until the timeout exceeds (instead of one connection attempt with a long timeout). Implementing such a change within our VNC client should be quite simple.

Actions #27

Updated by mkittler over 2 years ago

  • Status changed from Workable to In Progress

PR for increasing the connect timeout: https://github.com/os-autoinst/os-autoinst/pull/2064

Actions #28

Updated by openqa_review over 2 years ago

  • Due date set to 2022-06-02

Setting due date based on mean cycle time of SUSE QE Tools

Actions #29

Updated by okurz over 2 years ago

The PR looks fine. Do you plan to do the "multiple connecting attempts" in parallel?

Actions #30

Updated by mkittler over 2 years ago

  • Related to action #111063: Ping monitoring for our s390z mainframes size:S added
Actions #31

Updated by mkittler over 2 years ago

In parallel? That was never planned. And I also think that makes things only needlessly complicated (e.g. if one attempt succeeds one had to abort other attempts). But now it does multiple connect attempts within the extended timeout in sequence. Let's see whether it helps to reduce the number of those incompletes.

Actions #32

Updated by okurz over 2 years ago

mkittler wrote:

In parallel? That was never planned.

No I meant that just you would work on something in parallel, not the code :D

Actions #33

Updated by mkittler over 2 years ago

The extended timeout has been deployed since 20.05.22 07:23 CEST and the timeout for reads/writes on the VNC socket has been deployed since 18.05.22 07:17 CEST.

I suppose the problem of s390x jobs getting stuck and running into timeouts is resolved:

openqa=> with finished as (select result, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where result = 'timeout_exceeded') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_timeouts, count(*) total from finished where t_finished >= '2022-05-15' group by year, month, day order by year, month, day asc;
 year | month | day | ratio_of_all_timeouts | total 
------+-------+-----+-----------------------+-------
 2022 |     5 |  15 |                     0 |   404
 2022 |     5 |  16 |                  2.76 |   507
 2022 |     5 |  17 |                     0 |   545
 2022 |     5 |  18 |                  0.88 |   457
 2022 |     5 |  19 |                     0 |   457
 2022 |     5 |  20 |                     0 |   187
(6 Zeilen)

Note that the number of s390x jobs with VNC issues isn't that high overall:

openqa=> with finished as (select reason, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_vnc_connect_errors, count(*) total from finished where t_finished >= '2022-05-01' group by year, month, day order by year, month, day asc;
 year | month | day | ratio_of_all_vnc_connect_errors | total 
------+-------+-----+---------------------------------+-------
 2022 |     5 |   1 |                               0 |   140
 2022 |     5 |   2 |                            0.45 |   444
 2022 |     5 |   3 |                             0.3 |   657
 2022 |     5 |   4 |                            0.42 |   709
 2022 |     5 |   5 |                            1.63 |   676
 2022 |     5 |   6 |                            1.11 |  1079
 2022 |     5 |   7 |                            4.71 |   552
 2022 |     5 |   8 |                            0.25 |   405
 2022 |     5 |   9 |                            6.77 |   812
 2022 |     5 |  10 |                             5.2 |  1249
 2022 |     5 |  11 |                             1.8 |   333
 2022 |     5 |  12 |                            6.33 |   837
 2022 |     5 |  13 |                            4.03 |  1067
 2022 |     5 |  14 |                            1.07 |   466
 2022 |     5 |  15 |                            0.74 |   404
 2022 |     5 |  16 |                            1.97 |   507
 2022 |     5 |  17 |                            0.37 |   545
 2022 |     5 |  18 |                               0 |   457
 2022 |     5 |  19 |                            0.22 |   457
 2022 |     5 |  20 |                               0 |   192

There was actually only one occurrence within the last three days anymore:

openqa=> with finished as (select result, reason, t_finished from jobs where arch='s390x' and (result='incomplete')) select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC server%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_vnc_issues_within_incompletes, count(*) total_incompletes from finished where t_finished >= '2022-05-01' group by year, month, day order by year, month, day asc;
 year | month | day | ratio_of_vnc_issues_within_incompletes | total_incompletes 
------+-------+-----+----------------------------------------+-------------------
 2022 |     5 |   1 |                                      0 |                 1
 2022 |     5 |   2 |                                     50 |                 4
 2022 |     5 |   3 |                                     50 |                 4
 2022 |     5 |   4 |                                     60 |                 5
 2022 |     5 |   5 |                                    100 |                 9
 2022 |     5 |   6 |                                  84.62 |                13
 2022 |     5 |   7 |                                  78.13 |                32
 2022 |     5 |   8 |                                   8.33 |                12
 2022 |     5 |   9 |                                  82.81 |                64
 2022 |     5 |  10 |                                    100 |                65
 2022 |     5 |  11 |                                    100 |                 6
 2022 |     5 |  12 |                                  65.43 |                81
 2022 |     5 |  13 |                                   87.5 |                48
 2022 |     5 |  14 |                                  71.43 |                 7
 2022 |     5 |  15 |                                  10.34 |                29
 2022 |     5 |  16 |                                     80 |                10
 2022 |     5 |  17 |                                    100 |                 2
 2022 |     5 |  18 |                                      0 |                13
 2022 |     5 |  19 |                                    100 |                 1
 2022 |     5 |  20 |                                      0 |                 2

Looking at all jobs, there where only 5 jobs with VNC issues in the last three days and only one of them was an actual timeout:

penqa=> select id, t_finished, test, (select value from job_settings where job_id = jobs.id and key = 'BACKEND') as backend, arch, reason from jobs where reason like '%Error connecting to VNC server%' order by t_finished desc limit 10;
   id    |     t_finished      |                                                   test                                                    | backend |  arch  |                                                           reason                                                           
---------+---------------------+-----------------------------------------------------------------------------------------------------------+---------+--------+----------------------------------------------------------------------------------------------------------------------------
 8793234 | 2022-05-20 10:04:47 | jeos-containers:investigate:last_good_tests_and_build:1d3a5e7f5d2aa21788b5aacf7d2bde6af6fcbd78+20220518-1 | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused
 8793855 | 2022-05-20 10:04:42 | jeos-containers                                                                                           | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
 8787574 | 2022-05-19 08:48:28 | offline_sles12sp3_ltss_rmt_pcm-contm_def_full                                                             | svirt   | s390x  | backend died: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: timeout
 8773498 | 2022-05-18 01:28:07 | jeos-containers                                                                                           | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
 8773500 | 2022-05-18 01:27:40 | jeos-filesystem                                                                                           | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused
 8774388 | 2022-05-17 23:10:07 | qam-minimal+base                                                                                          | svirt   | s390x  | backend died: Error connecting to VNC server <10.161.145.92:5901>: IO::Socket::INET: connect: Connection timed out
 8771988 | 2022-05-17 15:08:36 | installation_sles11sp4_sles15SP3_sles15sp4_zVM_ph0                                                        | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa102.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
 8768593 | 2022-05-16 23:11:32 | qam-minimal+base                                                                                          | svirt   | s390x  | backend done: Error connecting to VNC server <10.161.145.96:5901>: IO::Socket::INET: connect: Connection timed out
 8768591 | 2022-05-16 23:10:48 | mru-install-minimal-with-addons                                                                           | svirt   | s390x  | backend died: Error connecting to VNC server <10.161.145.93:5901>: IO::Socket::INET: connect: Connection timed out
 8769221 | 2022-05-16 21:08:11 | qam-minimal-full                                                                                          | s390x   | s390x  | backend done: Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
(10 Zeilen)

In https://openqa.suse.de/tests/8793234/logfile?filename=autoinst-log.txt we can see that the increased timeout of 4 minutes is effective. I doubt that extending the timeout further would be helpful (the error here is "Connection refused" so the VNC server was likely not running anymore). I suppose this is actually a different issue (and we have #76813 for that).

https://openqa.suse.de/tests/8787574/logfile?filename=autoinst-log.txt ran before increasing the VNC timeout any it might have helped:

[2022-05-19T10:48:11.285168+02:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: timeout
[2022-05-19T10:48:11.285945+02:00] [debug] Closing SSH serial connection with s390zp18.suse.de

I'll check again next week.

Actions #34

Updated by okurz over 2 years ago

Please keep in mind that SLE maintenance tests run much less s390x jobs and SLE validation for new builds has slowed down significantly after SLE15SP4 GMC release so likely you are better off with explicitly triggered tests to get a good statistic.

Actions #35

Updated by mkittler over 2 years ago

Ok, I've been closing the last 100 passing s390x jobs on OSD: https://openqa.suse.de/tests/overview?build=test-s390x

Let's see how high the error rate will be.

Actions #36

Updated by mkittler over 2 years ago

Most of the jobs I've spawned were obsoleted:

openqa=> with test_jobs as (select distinct id, state, result from jobs where build = 'test-s390x') select state, result, count(id) * 100. / (select count(id) from test_jobs) as ratio from test_jobs group by test_jobs.state, test_jobs.result order by ratio desc;
   state   |     result      |         ratio          
-----------+-----------------+------------------------
 cancelled | obsoleted       |    76.4705882352941176
 done      | passed          |    16.6666666666666667
 done      | failed          |     3.9215686274509804
 done      | parallel_failed |     1.9607843137254902
 done      | softfailed      | 0.98039215686274509804
(5 Zeilen)

However, the jobs that ran mostly passed and the few failures were not due to VNC connection issues. In fact, none of the > 1200 s390x jobs ran into the VNC error:

openqa=> with finished as (select reason, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_vnc_connect_errors, count(*) total from finished where t_finished >= '2022-05-19' group by year, month, day order by year, month, day asc;
 year | month | day | ratio_of_all_vnc_connect_errors | total 
------+-------+-----+---------------------------------+-------
 2022 |     5 |  19 |                            0.22 |   457
 2022 |     5 |  20 |                               0 |   570
 2022 |     5 |  21 |                               0 |   508
 2022 |     5 |  22 |                               0 |   102
 2022 |     5 |  23 |                               0 |    98
(5 Zeilen)

So I would conclude that the issue isn't very apparent anymore.

Actions #37

Updated by mkittler over 2 years ago

  • Status changed from In Progress to Feedback
  • Priority changed from Urgent to High
Actions #38

Updated by okurz over 2 years ago

mkittler wrote:

So I would conclude that the issue isn't very apparent anymore.

We wouldn't gain a lot if the issue just become less probable because that would also reproducing and fixing harder. So we should be sure that the same issue realistically can not happen anymore. Of course connections could still time out but if that happens then it should be really clear to the test reviewers what to do. And they should not feel like they need to reopen this ticket or (even worse) just manually restart because they assume it's basically still the same issue.

Actions #39

Updated by mkittler over 2 years ago

  • Due date changed from 2022-06-02 to 2022-06-17

I'll be on vacation as of tomorrow. I'll check statistics again when I'm back.

Actions #40

Updated by mkittler over 2 years ago

So far it already looks good. Not a single more job with Error connecting to VNC.

Actions #41

Updated by mkittler over 2 years ago

We had some more VNC issues but the ratio of them is quite low:

openqa=> with finished as (select reason, t_finished from jobs where arch='s390x') select (extract(YEAR from t_finished)) as year, (extract(MONTH from t_finished)) as month, (extract(DAY from t_finished)) as day, round(count(*) filter (where reason like '%Error connecting to VNC%') * 100. / count(*), 2)::numeric(5,2)::float as ratio_of_all_vnc_connect_errors, count(*) total from finished where t_finished >= '2022-05-19' group by year, month, day order by year, month, day asc;
 year | month | day | ratio_of_all_vnc_connect_errors | total 
------+-------+-----+---------------------------------+-------
 2022 |     5 |  19 |                            0.22 |   446
 2022 |     5 |  20 |                               0 |   564
 2022 |     5 |  21 |                               0 |   499
 2022 |     5 |  22 |                               0 |    96
 2022 |     5 |  23 |                               0 |   372
 2022 |     5 |  24 |                               0 |   229
 2022 |     5 |  25 |                               0 |   383
 2022 |     5 |  26 |                               0 |   340
 2022 |     5 |  27 |                               0 |   339
 2022 |     5 |  28 |                               0 |   180
 2022 |     5 |  29 |                               0 |   283
 2022 |     5 |  30 |                               0 |   483
 2022 |     5 |  31 |                               0 |   437
 2022 |     6 |   1 |                               0 |   517
 2022 |     6 |   2 |                               0 |   370
 2022 |     6 |   3 |                            0.13 |   747
 2022 |     6 |   4 |                               0 |   356
 2022 |     6 |   5 |                            0.24 |   410
 2022 |     6 |   6 |                               0 |   261
 2022 |     6 |   7 |                               0 |   364
 2022 |     6 |   8 |                               0 |   447
 2022 |     6 |   9 |                               0 |   467
(22 Zeilen)

Those are the relevant jobs:

openqa=> select id, t_finished, test, (select value from job_settings where job_id = jobs.id and key = 'BACKEND') as backend, arch, reason from jobs where reason like '%Error connecting to VNC server%' order by t_finished desc limit 10;
   id    |     t_finished      |                                                   test                                                    | backend |  arch  |                                                           reason                                                           
---------+---------------------+-----------------------------------------------------------------------------------------------------------+---------+--------+----------------------------------------------------------------------------------------------------------------------------
 8888867 | 2022-06-05 10:49:20 | sle-micro_autoyast                                                                                        | svirt   | s390x  | backend died: Error connecting to VNC server <10.161.145.83:5901>: IO::Socket::INET: connect: timeout
 8871077 | 2022-06-03 09:37:44 | qam-minimal-full                                                                                          | s390x   | s390x  | backend died: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
 8870924 | 2022-06-03 08:42:54 | jeos-containers                                                                                           | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
 8870930 | 2022-06-03 08:42:53 | jeos-fips                                                                                                 | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5902>: IO::Socket::INET: connect: Connection refused
 8830494 | 2022-05-27 05:59:24 | jeos-fs_stress                                                                                            | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5902>: IO::Socket::INET: connect: Connection refused
 8830463 | 2022-05-27 05:59:19 | jeos-containers                                                                                           | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused
 8793234 | 2022-05-20 10:04:47 | jeos-containers:investigate:last_good_tests_and_build:1d3a5e7f5d2aa21788b5aacf7d2bde6af6fcbd78+20220518-1 | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5903>: IO::Socket::INET: connect: Connection refused
 8793855 | 2022-05-20 10:04:42 | jeos-containers                                                                                           | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
 8787574 | 2022-05-19 08:48:28 | offline_sles12sp3_ltss_rmt_pcm-contm_def_full                                                             | svirt   | s390x  | backend died: Error connecting to VNC server <10.161.145.86:5901>: IO::Socket::INET: connect: timeout
 8773498 | 2022-05-18 01:28:07 | jeos-containers                                                                                           | svirt   | x86_64 | backend died: Error connecting to VNC server <openqaw5-xen.qa.suse.de:5901>: IO::Socket::INET: connect: Connection refused
(10 Zeilen)

Job 8888867 actually shows a VNC-related error during the installation (see https://openqa.suse.de/tests/8888867#step/installation/2) and that error is not present in the previous run (see https://openqa.suse.de/tests/8909978#step/installation/3) so it might actually be a product bug.

The logs of job 8871077 show that the timeout and re-connect attempts work as expected but were still not enough (so presumably the VNC server got stuck/stopped/crashed):

[2022-06-03T11:33:36.162953+02:00] [warn] !!! consoles::VNC::update_framebuffer: Error in VNC protocol - relogin: short read for zrle data 991 - 2244
[2022-06-03T11:34:38.221298+02:00] [warn] !!! consoles::VNC::login: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
[2022-06-03T11:35:39.273462+02:00] [warn] !!! consoles::VNC::login: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
[2022-06-03T11:36:40.319105+02:00] [warn] !!! consoles::VNC::login: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout
[2022-06-03T11:37:41.382654+02:00] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout

The same counts for 8870924 and the further jobs. Not sure whether we can do much about it and why the VNC server on the other end doesn't work anymore. I'm also wondering what the difference between the timeouts and the refused connections is. Maybe in case of the timeouts the VNC server on the other end is technically still listening but stuck? Looks like s390x jobs are normally running into the timeout and x86_64 jobs into connection refused. So those are likely distinct problems.

Actions #42

Updated by mkittler over 2 years ago

Only 3 new jobs have failed for the last 4 days (https://openqa.suse.de/tests/8939209, https://openqa.suse.de/tests/8928450, https://openqa.suse.de/tests/8928436). That's still a small percentage. I'm wondering whether that's already acceptable. If not I suppose I'd try to debug the VNC servers themselves because the retry on the client-side should already cover client or connection problems. I just have no idea how to debug the VNC servers. (The svirt/s390x and svirt/x86_64 cases seem to be different as there's also a slightly different error message and the VNC server is maybe provided differently.)

Actions #43

Updated by okurz over 2 years ago

mkittler wrote:

Only 3 new jobs have failed for the last 4 days (https://openqa.suse.de/tests/8939209, https://openqa.suse.de/tests/8928450, https://openqa.suse.de/tests/8928436). That's still a small percentage. I'm wondering whether that's already acceptable.

Just put yourself into the shoes of the (poor) openQA test maintainers or maybe a package maintainer that learns that their package update is not accepted due to any of the above test failures. What would you do? Who should handle such failures?

Actions #44

Updated by mkittler over 2 years ago

The only thing I can improve right now is having those jobs restarted automatically: https://github.com/os-autoinst/openQA/pull/4702

I actually thought it would already be the case the the current regex isn't matching the remaining cases anymore.

Actions #46

Updated by okurz over 2 years ago

  • Due date deleted (2022-06-17)
  • Status changed from Feedback to Resolved

With that merged and deployed and openqa-query-for-job-label poo#99345 only showing

8871077|2022-06-03 09:37:44|done|incomplete|qam-minimal-full|backend died: Error connecting to VNC server <s390qa105.qa.suse.de:5901>: IO::Socket::INET: connect: timeout|openqaworker2
8769221|2022-05-16 21:08:11|done|failed|qam-minimal-full|backend done: Error connecting to VNC server <s390qa103.qa.suse.de:5901>: IO::Socket::INET: connect: timeout|openqaworker2

we assume that this is resolved by now even though we did not dig deeper into VNC server debugging. Anyway, for new specific problem observations please open a separate, dedicate ticket

Actions

Also available in: Atom PDF