action #20022

[sle][functional][zkvm][s390] incomplete test due to socket does not exist. Probably your backend instance could not start or died

Added by riafarov over 2 years ago. Updated over 2 years ago.

Status:ResolvedStart date:23/06/2017
Priority:UrgentDue date:25/10/2017
Assignee:mgriessmeier% Done:

0%

Category:Bugs in existing tests
Target version:openQA Project - Milestone 11
Difficulty:
Duration: 89

Description

Observation

Please, see log extracts. Seems to be sporadic issue, use repetition

openQA test in scenario sle-12-SP3-Server-DVD-s390x-lvm-full-encrypt@zkvm fails in
nautilus

Reproducible

Fails since (at least) Build 0439 (current job)

Expected result

Last good: 0437 (or more recent)

Further details

Always latest result in this scenario: latest

Please enter passphrase for disk cr_vda2! 09:46:54.0059 26913 >>> testapi::wait_serial: (?:Welcome to SUSE Linux Enterprise Server.*(s390x)): fail
09:46:54.0060 Debug: /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/x11/reboot_gnome.pm:23 called opensusebasetest::wait_boot
09:46:54.0061 26913 <<< testapi::select_console(testapi_console='x11')
/usr/lib/os-autoinst/consoles/vnc_base.pm:64:{
'port' => 5901,
'hostname' => '10.161.145.4',
'password' => 'nots3cr3t'
}
DIE socket does not exist. Probably your backend instance could not start or died. at /usr/lib/os-autoinst/consoles/VNC.pm line 881.

at /usr/lib/os-autoinst/backend/baseclass.pm line 80.
backend::baseclass::die_handler('socket does not exist. Probably your backend instance could n...') called at /usr/lib/os-autoinst/consoles/VNC.pm line 801
consoles::VNC::catch {...} ('socket does not exist. Probably your backend instance could n...') called at /usr/lib/perl5/vendor_perl/5.18.2/Try/Tiny.pm line 115
Try::Tiny::try('CODE(0x82dad60)', 'Try::Tiny::Catch=REF(0x766fb60)') called at /usr/lib/os-autoinst/consoles/VNC.pm line 803
consoles::VNC::update_framebuffer('consoles::VNC=HASH(0x766ff68)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 74
consoles::vnc_base::request_screen_update('consoles::vnc_base=HASH(0x4b30870)') called at /usr/lib/os-autoinst/consoles/vnc_base.pm line 86
consoles::vnc_base::current_screen('consoles::vnc_base=HASH(0x4b30870)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 587
backend::baseclass::capture_screenshot('backend::svirt=HASH(0x67fa348)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 482
backend::baseclass::select_console('backend::svirt=HASH(0x67fa348)', 'HASH(0x766f7f0)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 75
backend::baseclass::handle_command('backend::svirt=HASH(0x67fa348)', 'HASH(0x7667028)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 436
backend::baseclass::check_socket('backend::svirt=HASH(0x67fa348)', 'IO::Handle=GLOB(0x6580458)', 0) called at /usr/lib/os-autoinst/backend/svirt.pm line 236
backend::svirt::check_socket('backend::svirt=HASH(0x67fa348)', 'IO::Handle=GLOB(0x6580458)', 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 208
eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 156
backend::baseclass::run_capture_loop('backend::svirt=HASH(0x67fa348)') called at /usr/lib/os-autoinst/backend/baseclass.pm line 129
backend::baseclass::run('backend::svirt=HASH(0x67fa348)', 6, 9) called at /usr/lib/os-autoinst/backend/driver.pm line 85
backend::driver::start('backend::driver=HASH(0x6d5b2a0)') called at /usr/lib/os-autoinst/backend/driver.pm line 48
backend::driver::new('backend::driver', 'svirt') called at /usr/bin/isotovideo line 212
main::init_backend() called at /usr/bin/isotovideo line 276
09:49:01.2483 26915 Destroying openQA-SUT-2 virtual machine
09:49:01.3508 26915 Connection to root@s390pb.suse.de established
09:49:01.9620 26915 Command's stdout:
Domain openQA-SUT-2 destroyed

26910: EXIT 1
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":34037"
after 2680 requests (2680 known processed) with 0 events remaining.
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":40683"
after 2747 requests (2747 known processed) with 0 events remaining.
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":43057"
after 2747 requests (2747 known processed) with 0 events remaining.
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":35703"
after 4525 requests (4525 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":34037"
xterm: xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":35703"
fatal IO error 104 (Connection reset by peer) or KillClient on X server ":35703"


Related issues

Related to openQA Tests - action #12198: [s390][zkvm] bootloader_zkvm fails Resolved 01/06/2016
Related to openQA Tests - action #25638: [sles][functional][s390x] test fails in shutdown: VNC sta... Resolved 28/09/2017 25/10/2017
Related to openQA Tests - action #23650: [sle][functional][ipmi][epic][u] Fix test suite gnome to ... Blocked 20/10/2017
Blocked by openQA Tests - action #19350: [sle][functional][s390x][zkvm][hard] make unavailable ssh... Resolved 24/05/2017 17/01/2018

History

#1 Updated by riafarov over 2 years ago

  • Related to action #12198: [s390][zkvm] bootloader_zkvm fails added

#2 Updated by dgutu over 2 years ago

Happens with Build0440:
https://openqa.suse.de/tests/1021792

#3 Updated by okurz over 2 years ago

  • Assignee set to mgriessmeier
  • Priority changed from Normal to High

build 0450: https://openqa.suse.de/tests/1023268

@mgriessmeier do you think we can take a look to turn the incomplete into fail?

#4 Updated by mgriessmeier over 2 years ago

okurz wrote:

build 0450: https://openqa.suse.de/tests/1023268


@mgriessmeier do you think we can take a look to turn the incomplete into fail?

yes, probably we just miss a function call in the reboot_gnome - though we need a better error handling for the incomplete...

I plan to work on hpc stuff now, but we can have a look together later that day

#6 Updated by SLindoMansilla over 2 years ago

It also happens on other scenario: https://openqa.suse.de/tests/1028107

#7 Updated by okurz over 2 years ago

recent example: https://openqa.suse.de/tests/1033355

I read in the logfile "Please enter passphrase for disk cr_vda2!" so I guess we are just expecting the system to bootup but it's waiting for the password (besides to total non-obviousness)

#8 Updated by okurz over 2 years ago

Sergio, your test is certainly something completely different. It's about IMPI there -> #19958

#9 Updated by mgriessmeier over 2 years ago

okurz wrote:

recent example: https://openqa.suse.de/tests/1033355


I read in the logfile "Please enter passphrase for disk cr_vda2!" so I guess we are just expecting the system to bootup but it's waiting for the password (besides to total non-obviousness)

agreed... it seems like 'unlock_if_encrypted' is not called in the last reboot....

it works in previous reboots

#10 Updated by mgriessmeier over 2 years ago

  • Status changed from New to In Progress

#11 Updated by mgriessmeier over 2 years ago

  • Status changed from In Progress to Resolved

production run successful:
https://openqa.suse.de/tests/1038639

closing as fixed

#12 Updated by okurz over 2 years ago

  • Status changed from Resolved to In Progress

@mgriessmeier can we please brainstorm what to do about better feedback here in case of errors. Look at the subject line "lvm-full-encrypt incomplete test due to socket does not exist. Probably your backend instance could not start or died" which is not related at all to the problem we had: No one entered a password.

#13 Updated by mgriessmeier over 2 years ago

what happened to our brainstorming?
Can I unassign here? close? move over to tools?
Don't really know how to handle this right now

#14 Updated by mgriessmeier over 2 years ago

  • Subject changed from [zkvm][s390] lvm-full-encrypt incomplete test due to socket does not exist. Probably your backend instance could not start or died to [zkvm][s390] incomplete test due to socket does not exist. Probably your backend instance could not start or died
  • Assignee deleted (mgriessmeier)

not working on that right now, maybe someone else wants to have a look

#15 Updated by okurz over 2 years ago

  • Subject changed from [zkvm][s390] incomplete test due to socket does not exist. Probably your backend instance could not start or died to [sle][functional][zkvm][s390] incomplete test due to socket does not exist. Probably your backend instance could not start or died
  • Due date set to 11/10/2017
  • Assignee set to mgriessmeier

after our recent changes to zkvm we are either done here or should revisit the ticket to find out what needs to be done next.

#16 Updated by okurz over 2 years ago

  • Blocked by action #19350: [sle][functional][s390x][zkvm][hard] make unavailable ssh based zkvm consoles more obvious in the backend (was: [consistent] unable to switch to text terminal in consoletest_setup -> bsc#1040606) added

#18 Updated by mgriessmeier over 2 years ago

  • Related to action #25638: [sles][functional][s390x] test fails in shutdown: VNC stall detected, needs to be investigated added

#19 Updated by mgriessmeier over 2 years ago

Work in Progress PR created, unfortunately not as far progressed as we wanted to have it due to more important issues popping up
@okurz, nsinger: hopefully you can take this as a base to continue further in this sprint.

we turned the die into an Exception, but failed to add a record_info box - though we found a nice way to reproduce the "Socket does not exist" issue consistently (nsinger knows more about that)

https://github.com/os-autoinst/os-autoinst/pull/862

#20 Updated by mgriessmeier over 2 years ago

  • Related to action #23650: [sle][functional][ipmi][epic][u] Fix test suite gnome to work on ipmi 12-SP3 and 15 (WAS: test fails in boot_from_pxe - connection refused trying to ipmi host over ssh?) added

#21 Updated by riafarov over 2 years ago

PR with fix of review comment to be able to merge it: https://github.com/os-autoinst/os-autoinst/pull/864

#22 Updated by okurz over 2 years ago

  • Due date changed from 11/10/2017 to 25/10/2017

did not complete in sprint 1. main reason: spontaneous packaging training which we were not aware of in before. we have the PR which should improve user feedback a lot and this is definitely possible in the next sprint 2.

#23 Updated by okurz over 2 years ago

  • Target version set to Milestone 11

#24 Updated by okurz over 2 years ago

mgriessmeier in vacation, unassigning for now.

#25 Updated by okurz over 2 years ago

  • Assignee deleted (mgriessmeier)

#26 Updated by okurz over 2 years ago

recent example from a "minimal_x" scenario failing in "consoletest_finish" on select_console('x11'): https://openqa.suse.de/tests/1216076
Here the switch to 'x11' can not work because that tries to access a VNC server which is most likely not even running on the SUT on SLE15.

I suggest to try to improve the debugging on select_console('x11') failing in general. We should be able to simply reproduce this by running a synthetic scenario which is immediately as the first and only command running select_console('x11') on a machine that most likely has not even been started at this point. Still, the test should then not incomplete but fail. At best with a more helpful error message.

We see this same symptom a lot now in build 303.1 where tests actually proceed way further but then all stop in consoletest_finish on the above error.

#27 Updated by okurz over 2 years ago

  • Priority changed from High to Urgent

#28 Updated by mgriessmeier over 2 years ago

  • Assignee set to mgriessmeier

Re-assigning after vacation
WIP-PR in place: https://github.com/os-autoinst/os-autoinst/pull/870

#29 Updated by okurz over 2 years ago

  • Assignee changed from mgriessmeier to coolo

So, many things happened. mgriessmeier, riafarov, nsinger and me looked into handling the "connection refused" problem by trying to catch the die in the console and handling it in the testapi as a fail instead of incomplete (backend crash), see the PR by mgriessmeier about this.

coolo will take a look if he can handle the backend part better: https://github.com/os-autoinst/os-autoinst/pull/872

I disabled the s390x x11 tests completely on SLE15 for now with https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/3736 because of bug bsc#1058071

#30 Updated by mgriessmeier over 2 years ago

  • Status changed from In Progress to Resolved
  • Assignee changed from coolo to mgriessmeier

I consider this as fixed since I've verified coolos' patch and jobs on are now failing with a proper error message
http://opeth/tests/5725#step/fail_early/2

Also available in: Atom PDF