Project

General

Profile

Actions

action #37785

closed

[functional][s390x][u] test fails in start_install - maybe disable stall detection?

Added by nicksinger over 6 years ago. Updated almost 5 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
Bugs in existing tests
Target version:
Start date:
2018-06-25
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-12-SP4-Server-DVD-s390x-xfs@s390x-kvm-sle12 fails in
start_install.

Hypothesis

From the logs it looks like the stall-detection is kicking in before the waiting needle has a chance to match.
The "stall" seems to happen if the progress bar does not update after more then 4.26s. This hypothesis is based on what I could see in the logs:

2018-06-21T20:03:02.0931 CEST] [debug] MATCH(rebootnow-20131217:0.00)
[2018-06-21T20:03:03.0004 CEST] [debug] MATCH(rebootnow-20150409:0.00)
[2018-06-21T20:03:03.0077 CEST] [debug] MATCH(rebootnow-20160504:0.64)
[2018-06-21T20:03:03.0151 CEST] [debug] MATCH(rebootnow-390x-20150709:0.26)
[2018-06-21T20:03:03.0216 CEST] [debug] MATCH(rebootnow-390x-20160506:0.00)
[2018-06-21T20:03:03.0348 CEST] [debug] MATCH(install_and_reboot-additional-packages-20170823:0.09)
[2018-06-21T20:03:03.0352 CEST] [debug] no match: 1675.0s
[2018-06-21T20:03:03.0352 CEST] [debug] considering VNC stalled, no update for 4.26 seconds
[2018-06-21T20:03:05.0969 CEST] [debug] GET "/7taHN1Dnqxy0gF22/isotovideo/status"
[2018-06-21T20:03:05.0970 CEST] [debug] Routing to a callback
DIE Error connecting to host <10.161.145.14>: IO::Socket::INET: connect: Connection timed out
 at /usr/lib/os-autoinst/backend/baseclass.pm line 80.
    backend::baseclass::die_handler('OpenQA::Exception::VNCSetupError=HASH(0x5ecd188)') called at /usr/lib/perl5/vendor_perl/5.18.2/Exception/Class/Base.pm line 85
    Exception::Class::Base::throw('OpenQA::Exception::VNCSetupError', 'error', 'Error connecting to host <10.161.145.14>: IO::Socket::INET: c...') called at /usr/lib/os-autoinst/consoles/VNC.pm line 151
    consoles::VNC::login('consoles::VNC=HASH(0x5ed0520)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 842
    consoles::VNC::send_update_request('consoles::VNC=HASH(0x5ed0520)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 82
    consoles::vnc_base::request_screen_update('consoles::vnc_base=HASH(0x467e358)', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 587
    backend::baseclass::bouncer('backend::svirt=HASH(0x6d7cc18)', 'request_screen_update', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 570
    backend::baseclass::request_screen_update('backend::svirt=HASH(0x6d7cc18)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 177
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 156
    backend::baseclass::run_capture_loop('backend::svirt=HASH(0x6d7cc18)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 129
    backend::baseclass::run('backend::svirt=HASH(0x6d7cc18)', 5, 8) called at /usr/lib/os-autoinst/backend/driver.pm line 85
    backend::driver::start('backend::driver=HASH(0x5cc89d8)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
    backend::driver::new('backend::driver', 'svirt') called at /usr/bin/isotovideo line 236
    main::init_backend() called at /usr/bin/isotovideo line 305
[2018-06-21T20:05:10.0632 CEST] [debug] Destroying openQA-SUT-2 virtual machine
[2018-06-21T20:05:10.0703 CEST] [debug] Connection to root@s390p8.suse.de established
[2018-06-21T20:05:11.0259 CEST] [debug] Command's stdout:
Domain openQA-SUT-2 destroyed

But take it with a grain of salt:

16:39 <nsinger> foursixnine: https://openqa.suse.de/tests/1777076/file/autoinst-log.txt does the "connection timed out" means that the stall-detection kicked in?
16:39 <nsinger> or is it not directly correlated?
16:43 <foursixnine> nsinger: I wouldn't put my life on it, but looks like
16:43 <foursixnine> that part of the code tries to reconnect
16:44 <nsinger> I'm just curious if we may need to disable the stall-detection here since the needle-match timeout still has 1675.0s left at that time

So maybe disabling the stall detection already helps to circumvent this issue.

Reproducible

Fails since (at least) Build 0263 (current job)

Expected result

Last good: 0262 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Tests (public) - action #52763: [functiona][y] test incompletes in start_install after 3hRejectedriafarov2019-06-08

Actions
Actions

Also available in: Atom PDF