action #37785

[functional][s390x][u] test fails in start_install - maybe disable stall detection?

Added by nicksinger almost 2 years ago. Updated about 1 month ago.

Status:RejectedStart date:25/06/2018
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:Bugs in existing tests
Target version:QA - future
Difficulty:
Duration:

Description

Observation

openQA test in scenario sle-12-SP4-Server-DVD-s390x-xfs@s390x-kvm-sle12 fails in
start_install.

Hypothesis

From the logs it looks like the stall-detection is kicking in before the waiting needle has a chance to match.
The "stall" seems to happen if the progress bar does not update after more then 4.26s. This hypothesis is based on what I could see in the logs:

2018-06-21T20:03:02.0931 CEST] [debug] MATCH(rebootnow-20131217:0.00)
[2018-06-21T20:03:03.0004 CEST] [debug] MATCH(rebootnow-20150409:0.00)
[2018-06-21T20:03:03.0077 CEST] [debug] MATCH(rebootnow-20160504:0.64)
[2018-06-21T20:03:03.0151 CEST] [debug] MATCH(rebootnow-390x-20150709:0.26)
[2018-06-21T20:03:03.0216 CEST] [debug] MATCH(rebootnow-390x-20160506:0.00)
[2018-06-21T20:03:03.0348 CEST] [debug] MATCH(install_and_reboot-additional-packages-20170823:0.09)
[2018-06-21T20:03:03.0352 CEST] [debug] no match: 1675.0s
[2018-06-21T20:03:03.0352 CEST] [debug] considering VNC stalled, no update for 4.26 seconds
[2018-06-21T20:03:05.0969 CEST] [debug] GET "/7taHN1Dnqxy0gF22/isotovideo/status"
[2018-06-21T20:03:05.0970 CEST] [debug] Routing to a callback
DIE Error connecting to host <10.161.145.14>: IO::Socket::INET: connect: Connection timed out
 at /usr/lib/os-autoinst/backend/baseclass.pm line 80.
    backend::baseclass::die_handler('OpenQA::Exception::VNCSetupError=HASH(0x5ecd188)') called at /usr/lib/perl5/vendor_perl/5.18.2/Exception/Class/Base.pm line 85
    Exception::Class::Base::throw('OpenQA::Exception::VNCSetupError', 'error', 'Error connecting to host <10.161.145.14>: IO::Socket::INET: c...') called at /usr/lib/os-autoinst/consoles/VNC.pm line 151
    consoles::VNC::login('consoles::VNC=HASH(0x5ed0520)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 842
    consoles::VNC::send_update_request('consoles::VNC=HASH(0x5ed0520)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 82
    consoles::vnc_base::request_screen_update('consoles::vnc_base=HASH(0x467e358)', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 587
    backend::baseclass::bouncer('backend::svirt=HASH(0x6d7cc18)', 'request_screen_update', undef) called at /usr/lib/os-autoinst/backend/baseclass.pm line 570
    backend::baseclass::request_screen_update('backend::svirt=HASH(0x6d7cc18)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 177
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 156
    backend::baseclass::run_capture_loop('backend::svirt=HASH(0x6d7cc18)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 129
    backend::baseclass::run('backend::svirt=HASH(0x6d7cc18)', 5, 8) called at /usr/lib/os-autoinst/backend/driver.pm line 85
    backend::driver::start('backend::driver=HASH(0x5cc89d8)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
    backend::driver::new('backend::driver', 'svirt') called at /usr/bin/isotovideo line 236
    main::init_backend() called at /usr/bin/isotovideo line 305
[2018-06-21T20:05:10.0632 CEST] [debug] Destroying openQA-SUT-2 virtual machine
[2018-06-21T20:05:10.0703 CEST] [debug] Connection to root@s390p8.suse.de established
[2018-06-21T20:05:11.0259 CEST] [debug] Command's stdout:
Domain openQA-SUT-2 destroyed

But take it with a grain of salt:

16:39 <nsinger> foursixnine: https://openqa.suse.de/tests/1777076/file/autoinst-log.txt does the "connection timed out" means that the stall-detection kicked in?
16:39 <nsinger> or is it not directly correlated?
16:43 <foursixnine> nsinger: I wouldn't put my life on it, but looks like
16:43 <foursixnine> that part of the code tries to reconnect
16:44 <nsinger> I'm just curious if we may need to disable the stall-detection here since the needle-match timeout still has 1675.0s left at that time

So maybe disabling the stall detection already helps to circumvent this issue.

Reproducible

Fails since (at least) Build 0263 (current job)

Expected result

Last good: 0262 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Tests - action #52763: [functiona][y] test incompletes in start_install after 3h Rejected 08/06/2019

History

#1 Updated by okurz almost 2 years ago

  • Subject changed from [functional][s390x][u][fast] test fails in start_install - maybe disable stall detection? to [functional][s390x][u] test fails in start_install - maybe disable stall detection?
  • Target version set to future

Hm, why fast? I don't see it this way.

#2 Updated by okurz 10 months ago

  • Related to action #52763: [functiona][y] test incompletes in start_install after 3h added

#3 Updated by mgriessmeier about 1 month ago

  • Status changed from New to Rejected

no latest present anymore, issue is addressed in many other tickets

Also available in: Atom PDF