action #48260

[sle][functional][u][s390x][kvm] test fails in reboot_after_installation - "The console isn't responding correctly. Maybe half-open socket?" - we all hate it ;)

Added by okurz about 1 year ago. Updated 12 months ago.

Status:ResolvedStart date:21/02/2019
Priority:UrgentDue date:
Assignee:okurz% Done:

0%

Category:Bugs in existing tests
Target version:SUSE QA tests - Milestone 23
Difficulty:
Duration:

Description

Observation

openQA test in scenario sle-15-SP1-Installer-DVD-s390x-gnome@s390x-kvm-sle12 fails in
reboot_after_installation

autoinst-log.txt states:

[2019-02-21T23:50:25.512 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/installation/reboot_after_installation.pm:28 called power_action_utils::power_action
[2019-02-21T23:50:25.513 CET] [debug] <<< testapi::check_shutdown(timeout=180)
[2019-02-21T23:50:25.608 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:50:25.687 CET] [debug] Command executed: ! virsh  dominfo openQA-SUT-4 | grep -w 'shut off', ret=0
[2019-02-21T23:50:26.778 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:50:26.857 CET] [debug] Command executed: ! virsh  dominfo openQA-SUT-4 | grep -w 'shut off', ret=0
%@[2019-02-21T23:50:27.947 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:50:28.027 CET] [debug] Command executed: ! virsh  dominfo openQA-SUT-4 | grep -w 'shut off', ret=0

[2019-02-21T23:54:15.127 CET] [debug] Backend process died, backend errors are reported below in the following lines:
The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 245.

[2019-02-21T23:54:15.127 CET] [debug] Destroying openQA-SUT-4 virtual machine
[2019-02-21T23:54:15.217 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:54:15.296 CET] [debug] Command's stdout:


[2019-02-21T23:54:15.296 CET] [debug] Command's stderr:
error: Failed to destroy domain openQA-SUT-4
error: Requested operation is not valid: domain is not running

[2019-02-21T23:54:15.296 CET] [debug] Command executed: virsh  destroy openQA-SUT-4, ret=1
[2019-02-21T23:54:15.387 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:54:15.464 CET] [debug] Command's stdout:
Domain openQA-SUT-4 has been undefined


[2019-02-21T23:54:15.464 CET] [debug] Command executed: virsh  undefine --snapshots-metadata openQA-SUT-4, ret=0
last frame

Reproducible

Fails what seems like more and more lately

Expected result

Last good: 165.1

Further details

Always latest result in this scenario: latest


Related issues

Related to openQA Tests - action #46964: [functional][u][s390x] test fails in the middle of execut... Resolved 01/02/2019
Related to openQA Tests - action #46394: [sle][functional][u][s390x][spvm][kvm][sporadic] test fai... Workable
Related to openQA Tests - action #33202: [sle][functional][s390x][zkvm][u][hard] test fails in boo... Resolved 13/03/2018 14/08/2018
Related to openQA Tests - action #49958: [functional][u][s390x] test fails in reboot_after_install... Rejected 03/04/2019
Duplicated by openQA Tests - action #48284: [sle][migration][SLE15SP1]test fails in reboot_after_inst... Rejected 22/02/2019
Blocked by openQA Tests - action #48170: [functional][u][svirt][xen] SUTs are not able to reach SU... Resolved 21/02/2019

History

#1 Updated by okurz about 1 year ago

  • Related to action #46964: [functional][u][s390x] test fails in the middle of execution (not installation) as incomplete with "half-open socket?" – connection to machine vanished? added

#2 Updated by okurz about 1 year ago

  • Related to action #46394: [sle][functional][u][s390x][spvm][kvm][sporadic] test fails in various modules to login to svirt console (or system is not up yet) added

#3 Updated by okurz about 1 year ago

  • Priority changed from High to Urgent

#4 Updated by mgriessmeier about 1 year ago

imho this is related to the worker issues we have on openqaworker2 and grenache (not speaking about the network related ones)
but tbh: I'm too tired of implementing shi**y workarounds to address those worker issues...

#5 Updated by okurz about 1 year ago

sure, we should focus on making the errors more obvious and clean, less workarounds :)

#6 Updated by okurz about 1 year ago

  • Assignee set to coolo

@coolo I have the suspicion that your change https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/127/ has caused the issue as reported above as well as other problems – see linked tickets, e.g. – when we fail to login in time over ssh. Would you mind if we temporarily move away the s390x-kvm remote workers from grenache?

#7 Updated by okurz about 1 year ago

  • Blocked by action #48170: [functional][u][svirt][xen] SUTs are not able to reach SUSE network, e.g. failed to reach proxy.scc.suse.de added

#8 Updated by okurz about 1 year ago

  • Duplicated by action #48284: [sle][migration][SLE15SP1]test fails in reboot_after_installation - Backend process died added

#9 Updated by okurz about 1 year ago

  • Related to action #33202: [sle][functional][s390x][zkvm][u][hard] test fails in boot_to_desktop - still insufficient error reporting, black screen with mouse cursor - we all hate it (was: I hate it) added

#10 Updated by okurz about 1 year ago

  • Subject changed from [sle][functional][u][s390x][kvm] test fails in reboot_after_installation - "The console isn't responding correctly. Maybe half-open socket?" to [sle][functional][u][s390x][kvm] test fails in reboot_after_installation - "The console isn't responding correctly. Maybe half-open socket?" - we all hate it ;)
  • Status changed from Workable to In Progress
  • Assignee changed from coolo to okurz

What we can see in

[2019-02-21T23:50:28.027 CET] [debug] Command executed: ! virsh  dominfo openQA-SUT-4 | grep -w 'shut off', ret=0

[2019-02-21T23:54:15.127 CET] [debug] Backend process died, backend errors are reported below in the following lines:

is that there is a 3 minute period between the last output of "Command executed:" (consoles/sshVirtsh.pm::run_cmd) and the message about "Backend process died".

mgriessmeier and me assumed from consoles/sshVirtsh.pm::run_cmd that the messages should appear in this order:

Connection established
Command executed: …
Command's stdout

however what we see is that for the same timestamp first "stdout" and then "executed" which we can not explain yet.

Our hypothesis is that all the similar problems we see with the ssh connections from the worker to the hypervisor hosts or the SUTs might relate to the same root cause that the connection is lost in between and not handled properly with our use of Net::SSH2 with most likely an internal ssh timeout appearing as https://metacpan.org/pod/Net::SSH2#timeout-(-timeout_ms-) explains that by default there should be no timeout by Net::SSH2 . The error would be most likely to appear within the check_shutdown method of reboot_after_installation because for every call of run_cmd we establish a new ssh connection in a loop so many connection (attempts) in short time.

Idea for experiment, have minimum reproducer using multiple ssh connection attempts within short time from a perl script.

#11 Updated by okurz about 1 year ago

We have understood why the output looks like in the "wrong order". There are actually two methods "run_cmd", the second one added in https://github.com/os-autoinst/os-autoinst/pull/599 within backend/svirt. I guess we should reduce some of the duplication from there first and prevent that confusion. It looks like mnowak creates an explicit new ssh connection within backend/svirt to allow in general the use of snapshots and reverting to them.

#12 Updated by okurz about 1 year ago

Using the same approach as the run_cmd from backend/svirt running in a script from grenache-1.qa connecting to s390p8 does not show any problems to establish 100 connections in a row.

#13 Updated by okurz about 1 year ago

I managed to reproduce the problem locally with

openqa-clone-job --skip-chained-deps --host http://lord.arch --from https://openqa.suse.de 2488101 _SKIP_POST_FAIL_HOOKS=1 ROOTONLY=1 INCLUDE_MODULES=bootloader_zkvm,welcome,accept_license,skip_registration,addon_products_sle,partitioning,partitioning_finish,installer_timezone,user_settings,user_settings_root,installation_overview,start_install,await_install,reboot_after_installation,reconnect_mgmt_console

-> http://lord.arch/tests/2031

#14 Updated by okurz 12 months ago

Trying with more debugging plus internal timeout on ssh commands in os-autoinst:

diff --git a/backend/baseclass.pm b/backend/baseclass.pm
index ddddcae2..aa41a6c6 100644
--- a/backend/baseclass.pm
+++ b/backend/baseclass.pm
@@ -1168,7 +1168,7 @@ sub new_ssh_connection {
     my ($self, %args) = @_;
     $args{username} ||= 'root';

-    my $ssh = Net::SSH2->new;
+    my $ssh = Net::SSH2->new(timeout => 2 * 60 * 1000, trace => -1, debug => 1);

     # Retry 5 times, in case of the guest is not running yet
     my $counter = 5;
diff --git a/backend/svirt.pm b/backend/svirt.pm
index f4211ec6..7482276b 100644
--- a/backend/svirt.pm
+++ b/backend/svirt.pm
@@ -115,12 +115,14 @@ sub run_cmd {
     my ($ssh, $cmd) = @_;

     my $chan = $ssh->channel();
+    bmwqemu::diag "run_cmd exec: $cmd …";
     $chan->exec($cmd);
     get_ssh_output($chan);
     $chan->send_eof;
     my $ret = $chan->exit_status();
     bmwqemu::diag "Command executed: $cmd, ret=$ret";
     $chan->close();
+    bmwqemu::diag "channel closed";
     return $ret;
 }   

@@ -151,6 +153,7 @@ sub can_handle {

 sub is_shutdown {
     my ($self) = @_;
+    bmwqemu::diag "is_shutdown: begin";
     my $vmname = $self->console('svirt')->name;
     my $rsp;
     if (check_var('VIRSH_VMM_FAMILY', 'hyperv')) {
@@ -160,6 +163,7 @@ sub is_shutdown {
         my $libvirt_connector = get_var('VMWARE_REMOTE_VMM', '');
         $rsp = $self->run_ssh_cmd("! virsh $libvirt_connector dominfo $vmname | grep -w 'shut off'");
     }
+    bmwqemu::diag "is_shutdown: end";
     return $rsp;
 }

Tried locally with just isotovideo and that seemed to pass until after the reboot but I could not find any of the diag messages I put in. Let's see in openQA:

-> http://lord.arch/tests/2034

#15 Updated by okurz 12 months ago

So far I could not find out what the problem is.

[25/02/2019 13:24:43] <okurz> mgriessmeier: could you take a look into http://lord.arch/tests/2036/file/autoinst-log.txt ? Search for "Backend process died". I understand from the additional debug output I put in that the "installation" console is considered stalled and kills the backend although the installation scenario previously was called with "disable_vnc_stalls". Do you have a good idea if we need to do something else with the installation console before calling check_shutdown?
[25/02/2019 13:37:21] <mgriessmeier> okurz: hmm I don't see how you come to the conclusion why the installation console is stalled
[25/02/2019 13:37:33] <mgriessmeier> I only find _vnc_stalled' => 0,
[25/02/2019 13:38:27] <okurz> huh, that's a good point. Why has the backend died on "installation" console then?
[25/02/2019 13:40:47] <okurz> it's not about stall detection, more like the opposite: the connection is down nevertheless
[25/02/2019 13:41:09] <mgriessmeier> mhh yeah
[25/02/2019 13:42:30] <mgriessmeier> okurz: did you actually conduct the expiriment with the ssh script?
[25/02/2019 13:43:17] <mgriessmeier> because our first assumption was that it's related to the LPAR and not to a console
[25/02/2019 13:44:31] <okurz> mgriessmeier: yeah, the script is fine. I can connect 100 times in a row without problems from grenache-1.qa
[25/02/2019 13:45:06] <mgriessmeier> with more complex commands than 'ls' ?
[25/02/2019 13:45:15] <okurz> the dominfo one
[25/02/2019 13:45:21] <mgriessmeier> kay
[25/02/2019 13:46:51] <mgriessmeier> mnowak: did you have any issues recently with run_cmd and checking if the SUT is shut down?
[25/02/2019 13:47:36] <mnowak> mgriessmeier, not for some time. any failed job?
[25/02/2019 13:47:53] <mgriessmeier> all the incompletes on s390x :/
[25/02/2019 13:47:58] <mgriessmeier> the half-open socket thingy
[25/02/2019 13:48:10] <mgriessmeier> http://lord.arch/tests/2036/file/autoinst-log.txt  with debug output for example
[25/02/2019 13:51:55] <mnowak> mgriessmeier, this used to happen before disable_vnc_stalls was introduced, but that was triggered..
[25/02/2019 13:53:23] <okurz> mnowak: could it be your commit 1ca240982 introduced the problem?

Trying with and without the revert

env openqa-clone-set https://openqa.suse.de/tests/2488101 okurz_investigation_poo48260_current_master _SKIP_POST_FAIL_HOOKS=1 ROOTONLY=1 INCLUDE_MODULES=bootloader_zkvm,welcome,accept_license,skip_registration,addon_products_sle,partitioning,partitioning_finish,installer_timezone,user_settings,user_settings_root,installation_overview,start_install,await_install,reboot_after_installation,reconnect_mgmt_console

-> https://openqa.suse.de/tests/overview?distri=sle&version=15-SP1&build=okurz_investigation_poo48260_current_master on current master

openqa-clone-custom-git-refspec https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6890 https://openqa.suse.de/tests/2488101

Created job #2495939: sle-15-SP1-Installer-DVD-s390x-Build175.4-skip_registration@s390x-kvm-sle12 -> https://openqa.suse.de/t2495939

env build=okurz_investigation_poo48260_current_master openqa-clone-set https://openqa.suse.de/tests/2495939 okurz_investigation_poo48260_reverted1ca24098 _SKIP_POST_FAIL_HOOKS=1 ROOTONLY=1 INCLUDE_MODULES=bootloader_zkvm,welcome,accept_license,skip_registration,addon_products_sle,partitioning,partitioning_finish,installer_timezone,user_settings,user_settings_root,installation_overview,start_install,await_install,reboot_after_installation,reconnect_mgmt_console

same build.

#16 Updated by okurz 12 months ago

  • Status changed from In Progress to Feedback

That is a pretty convincing result. "current master" fails 10/10, reverted 1ca24098 is 0/10 failed. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6890 with that part reverted, merged.

#18 Updated by okurz 12 months ago

Looks better so far. We should still improve the logging output, e.g. what I have running on lord.arch

#20 Updated by okurz 12 months ago

  • Status changed from Feedback to Resolved

PR merged. As we should not have seen further related failures recently I think we are done here. I still have plans to improve the user feedback e.g. with turning the incomplete into a fail with a hint to the test reviewer/developer in the future.

#21 Updated by SLindoMansilla 11 months ago

  • Related to action #49958: [functional][u][s390x] test fails in reboot_after_installation - Failed to destroy domain openQA-SUT- added

Also available in: Atom PDF