Project

General

Profile

Actions

action #99345

closed

action #107062: Multiple failures due to network issues

[tools][qem] Incomplete test runs on s390x with auto_review:"backend died: Error connecting to VNC server.*s390.*Connection timed out":retry size:M

Added by vsvecova about 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Bugs in existing tests
Start date:
2021-09-27
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP4-Server-DVD-Updates-s390x-mru-install-minimal-with-addons@s390x-kvm-sle12 incomplete, stops at
start_install

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml.

Reproducible

Fails since (at least) Build 20210927-1 (current job)

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
openqa-query-for-job-label poo#99345

Expected result

Last good: 20210925-1 (or more recent)

Acceptance criteria

  • AC1: The root cause of the problem is known
  • AC2: The next steps are known and have been initiated

Suggestions

  • Talk to all the people involved to get the full story

Further details

Always latest result in this scenario: latest

IPMI and s390x workers keep losing VNC connection during SLES installation and the reconnect attempt gets stuck for some strange reason until the job hits MAX_JOB_TIME:

[2022-05-11T13:46:24.602260+02:00] [debug] <<< testapi::wait_screen_change(timeout=10, similarity_level=50)
XIO:  fatal IO error 4 (Interrupted system call) on X server ":37191"
      after 23426 requests (23426 known processed) with 0 events remaining.
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":34867"
      after 28852 requests (28852 known processed) with 0 events remaining.
[2022-05-11T15:36:53.708309+02:00] [debug] autotest received signal TERM, saving results of current test before exiting
[2022-05-11T15:36:53.708518+02:00] [debug] isotovideo received signal TERM
[2022-05-11T15:36:53.708516+02:00] [debug] backend got TERM

Note that the job spent 110 minutes in wait_screen_change() that was supposed to time out after 10 seconds.

In another job it was stuck on assert_screen:

[2022-05-11T15:58:59.169408+02:00] [debug] <<< testapi::assert_screen(mustmatch="installation", no_wait=1, timeout=30)[2022-05-11T16:15:16.486959+02:00] [warn] !!! consoles::VNC::catch {...} : Error in VNC protocol - relogin: short read for zrle data 659 - 950[2022-05-11T21:47:05.966962+02:00] [debug] backend got TERM[2022-05-11T21:47:05.966980+02:00] [debug] isotovideo received signal TERM[2022-05-11T21:47:05.967084+02:00] [debug] autotest received signal TERM, saving results of current test before exiting
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":44341"
      after 28689 requests (28689 known processed) with 0 events remaining.
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":60785"
      after 39307 requests (39307 known processed) with 0 events remaining.

Here we've even got a VNC error and the VNC client would try to re-login but I suppose it is pointless because the VNC server terminates when the connection is lost anyways. So for a real retry we likely needed to also restart the VNC server. (Note that @MDoucha tried to connect manually here, see https://progress.opensuse.org/issues/99345#note-9)


Related issues 5 (1 open4 closed)

Related to openQA Project (public) - action #76813: [tools] Test using svirt backend fails with auto_review:"Error connecting to VNC server.*: IO::Socket::INET: connect: Connection refused"New2020-10-30

Actions
Related to openQA Project (public) - action #111004: Timeout of test API functions not enforced if backend gets stuck, e.g. on the VNC socket size:MResolvedmkittler2022-05-122022-05-28

Actions
Related to openQA Project (public) - coordination #109656: [epic] Stable non-qemu backendsResolvedokurz2021-12-29

Actions
Related to openQA Infrastructure (public) - action #111063: Ping monitoring for our s390z mainframes size:SResolvedokurz2022-05-13

Actions
Has duplicate openQA Tests (public) - action #110902: [qe-core]qam-minimal-full@s390x-zVM-vswitch-l3 multiple failuresRejectedmgrifalconi

Actions
Actions

Also available in: Atom PDF