action #48260
closed[sle][functional][u][s390x][kvm] test fails in reboot_after_installation - "The console isn't responding correctly. Maybe half-open socket?"
0%
Description
Observation¶
openQA test in scenario sle-15-SP1-Installer-DVD-s390x-gnome@s390x-kvm-sle12 fails in
reboot_after_installation
autoinst-log.txt states:
[2019-02-21T23:50:25.512 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/installation/reboot_after_installation.pm:28 called power_action_utils::power_action
[2019-02-21T23:50:25.513 CET] [debug] <<< testapi::check_shutdown(timeout=180)
[2019-02-21T23:50:25.608 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:50:25.687 CET] [debug] Command executed: ! virsh dominfo openQA-SUT-4 | grep -w 'shut off', ret=0
[2019-02-21T23:50:26.778 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:50:26.857 CET] [debug] Command executed: ! virsh dominfo openQA-SUT-4 | grep -w 'shut off', ret=0
%@[2019-02-21T23:50:27.947 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:50:28.027 CET] [debug] Command executed: ! virsh dominfo openQA-SUT-4 | grep -w 'shut off', ret=0
[2019-02-21T23:54:15.127 CET] [debug] Backend process died, backend errors are reported below in the following lines:
The console isn't responding correctly. Maybe half-open socket? at /usr/lib/os-autoinst/backend/baseclass.pm line 245.
[2019-02-21T23:54:15.127 CET] [debug] Destroying openQA-SUT-4 virtual machine
[2019-02-21T23:54:15.217 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:54:15.296 CET] [debug] Command's stdout:
[2019-02-21T23:54:15.296 CET] [debug] Command's stderr:
error: Failed to destroy domain openQA-SUT-4
error: Requested operation is not valid: domain is not running
[2019-02-21T23:54:15.296 CET] [debug] Command executed: virsh destroy openQA-SUT-4, ret=1
[2019-02-21T23:54:15.387 CET] [debug] Connection to root@s390p8.suse.de established
[2019-02-21T23:54:15.464 CET] [debug] Command's stdout:
Domain openQA-SUT-4 has been undefined
[2019-02-21T23:54:15.464 CET] [debug] Command executed: virsh undefine --snapshots-metadata openQA-SUT-4, ret=0
last frame
Reproducible¶
Fails what seems like more and more lately
Expected result¶
Last good: 165.1
Further details¶
Always latest result in this scenario: latest
Updated by okurz almost 6 years ago
- Related to action #46964: [functional][u][s390x] test fails in the middle of execution (not installation) as incomplete with "half-open socket?" – connection to machine vanished? added
Updated by okurz almost 6 years ago
- Related to action #46394: [sle][s390x][spvm][kvm][sporadic] test fails in various modules to login to svirt console (or system is not up yet) added
Updated by mgriessmeier almost 6 years ago
imho this is related to the worker issues we have on openqaworker2 and grenache (not speaking about the network related ones)
but tbh: I'm too tired of implementing shi**y workarounds to address those worker issues...
Updated by okurz almost 6 years ago
sure, we should focus on making the errors more obvious and clean, less workarounds :)
Updated by okurz almost 6 years ago
- Assignee set to coolo
@coolo I have the suspicion that your change https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/127/ has caused the issue as reported above as well as other problems – see linked tickets, e.g. – when we fail to login in time over ssh. Would you mind if we temporarily move away the s390x-kvm remote workers from grenache?
Updated by okurz almost 6 years ago
- Blocked by action #48170: [functional][u][svirt][xen] SUTs are not able to reach SUSE network, e.g. failed to reach proxy.scc.suse.de added
Updated by okurz almost 6 years ago
- Has duplicate action #48284: [sle][migration][SLE15SP1]test fails in reboot_after_installation - Backend process died added
Updated by okurz almost 6 years ago
- Related to action #33202: [sle][functional][s390x][zkvm][u][hard] test fails in boot_to_desktop - still insufficient error reporting, black screen with mouse cursor - we all hate it (was: I hate it) added
Updated by okurz almost 6 years ago
- Subject changed from [sle][functional][u][s390x][kvm] test fails in reboot_after_installation - "The console isn't responding correctly. Maybe half-open socket?" to [sle][functional][u][s390x][kvm] test fails in reboot_after_installation - "The console isn't responding correctly. Maybe half-open socket?" - we all hate it ;)
- Status changed from Workable to In Progress
- Assignee changed from coolo to okurz
What we can see in
[2019-02-21T23:50:28.027 CET] [debug] Command executed: ! virsh dominfo openQA-SUT-4 | grep -w 'shut off', ret=0
[2019-02-21T23:54:15.127 CET] [debug] Backend process died, backend errors are reported below in the following lines:
is that there is a 3 minute period between the last output of "Command executed:" (consoles/sshVirtsh.pm::run_cmd) and the message about "Backend process died".
mgriessmeier and me assumed from consoles/sshVirtsh.pm::run_cmd
that the messages should appear in this order:
Connection established
Command executed: …
Command's stdout
however what we see is that for the same timestamp first "stdout" and then "executed" which we can not explain yet.
Our hypothesis is that all the similar problems we see with the ssh connections from the worker to the hypervisor hosts or the SUTs might relate to the same root cause that the connection is lost in between and not handled properly with our use of Net::SSH2 with most likely an internal ssh timeout appearing as https://metacpan.org/pod/Net::SSH2#timeout-(-timeout_ms-) explains that by default there should be no timeout by Net::SSH2 . The error would be most likely to appear within the check_shutdown
method of reboot_after_installation
because for every call of run_cmd
we establish a new ssh connection in a loop so many connection (attempts) in short time.
Idea for experiment, have minimum reproducer using multiple ssh connection attempts within short time from a perl script.
Updated by okurz almost 6 years ago
We have understood why the output looks like in the "wrong order". There are actually two methods "run_cmd", the second one added in https://github.com/os-autoinst/os-autoinst/pull/599 within backend/svirt. I guess we should reduce some of the duplication from there first and prevent that confusion. It looks like mnowak creates an explicit new ssh connection within backend/svirt to allow in general the use of snapshots and reverting to them.
Updated by okurz almost 6 years ago
Using the same approach as the run_cmd
from backend/svirt running in a script from grenache-1.qa connecting to s390p8 does not show any problems to establish 100 connections in a row.
Updated by okurz almost 6 years ago
I managed to reproduce the problem locally with
openqa-clone-job --skip-chained-deps --host http://lord.arch --from https://openqa.suse.de 2488101 _SKIP_POST_FAIL_HOOKS=1 ROOTONLY=1 INCLUDE_MODULES=bootloader_zkvm,welcome,accept_license,skip_registration,addon_products_sle,partitioning,partitioning_finish,installer_timezone,user_settings,user_settings_root,installation_overview,start_install,await_install,reboot_after_installation,reconnect_mgmt_console
Updated by okurz almost 6 years ago
Trying with more debugging plus internal timeout on ssh commands in os-autoinst:
diff --git a/backend/baseclass.pm b/backend/baseclass.pm
index ddddcae2..aa41a6c6 100644
--- a/backend/baseclass.pm
+++ b/backend/baseclass.pm
@@ -1168,7 +1168,7 @@ sub new_ssh_connection {
my ($self, %args) = @_;
$args{username} ||= 'root';
- my $ssh = Net::SSH2->new;
+ my $ssh = Net::SSH2->new(timeout => 2 * 60 * 1000, trace => -1, debug => 1);
# Retry 5 times, in case of the guest is not running yet
my $counter = 5;
diff --git a/backend/svirt.pm b/backend/svirt.pm
index f4211ec6..7482276b 100644
--- a/backend/svirt.pm
+++ b/backend/svirt.pm
@@ -115,12 +115,14 @@ sub run_cmd {
my ($ssh, $cmd) = @_;
my $chan = $ssh->channel();
+ bmwqemu::diag "run_cmd exec: $cmd …";
$chan->exec($cmd);
get_ssh_output($chan);
$chan->send_eof;
my $ret = $chan->exit_status();
bmwqemu::diag "Command executed: $cmd, ret=$ret";
$chan->close();
+ bmwqemu::diag "channel closed";
return $ret;
}
@@ -151,6 +153,7 @@ sub can_handle {
sub is_shutdown {
my ($self) = @_;
+ bmwqemu::diag "is_shutdown: begin";
my $vmname = $self->console('svirt')->name;
my $rsp;
if (check_var('VIRSH_VMM_FAMILY', 'hyperv')) {
@@ -160,6 +163,7 @@ sub is_shutdown {
my $libvirt_connector = get_var('VMWARE_REMOTE_VMM', '');
$rsp = $self->run_ssh_cmd("! virsh $libvirt_connector dominfo $vmname | grep -w 'shut off'");
}
+ bmwqemu::diag "is_shutdown: end";
return $rsp;
}
Tried locally with just isotovideo and that seemed to pass until after the reboot but I could not find any of the diag messages I put in. Let's see in openQA:
Updated by okurz almost 6 years ago
So far I could not find out what the problem is.
[25/02/2019 13:24:43] <okurz> mgriessmeier: could you take a look into http://lord.arch/tests/2036/file/autoinst-log.txt ? Search for "Backend process died". I understand from the additional debug output I put in that the "installation" console is considered stalled and kills the backend although the installation scenario previously was called with "disable_vnc_stalls". Do you have a good idea if we need to do something else with the installation console before calling check_shutdown?
[25/02/2019 13:37:21] <mgriessmeier> okurz: hmm I don't see how you come to the conclusion why the installation console is stalled
[25/02/2019 13:37:33] <mgriessmeier> I only find _vnc_stalled' => 0,
[25/02/2019 13:38:27] <okurz> huh, that's a good point. Why has the backend died on "installation" console then?
[25/02/2019 13:40:47] <okurz> it's not about stall detection, more like the opposite: the connection is down nevertheless
[25/02/2019 13:41:09] <mgriessmeier> mhh yeah
[25/02/2019 13:42:30] <mgriessmeier> okurz: did you actually conduct the expiriment with the ssh script?
[25/02/2019 13:43:17] <mgriessmeier> because our first assumption was that it's related to the LPAR and not to a console
[25/02/2019 13:44:31] <okurz> mgriessmeier: yeah, the script is fine. I can connect 100 times in a row without problems from grenache-1.qa
[25/02/2019 13:45:06] <mgriessmeier> with more complex commands than 'ls' ?
[25/02/2019 13:45:15] <okurz> the dominfo one
[25/02/2019 13:45:21] <mgriessmeier> kay
[25/02/2019 13:46:51] <mgriessmeier> mnowak: did you have any issues recently with run_cmd and checking if the SUT is shut down?
[25/02/2019 13:47:36] <mnowak> mgriessmeier, not for some time. any failed job?
[25/02/2019 13:47:53] <mgriessmeier> all the incompletes on s390x :/
[25/02/2019 13:47:58] <mgriessmeier> the half-open socket thingy
[25/02/2019 13:48:10] <mgriessmeier> http://lord.arch/tests/2036/file/autoinst-log.txt with debug output for example
[25/02/2019 13:51:55] <mnowak> mgriessmeier, this used to happen before disable_vnc_stalls was introduced, but that was triggered..
[25/02/2019 13:53:23] <okurz> mnowak: could it be your commit 1ca240982 introduced the problem?
Trying with and without the revert
env openqa-clone-set https://openqa.suse.de/tests/2488101 okurz_investigation_poo48260_current_master _SKIP_POST_FAIL_HOOKS=1 ROOTONLY=1 INCLUDE_MODULES=bootloader_zkvm,welcome,accept_license,skip_registration,addon_products_sle,partitioning,partitioning_finish,installer_timezone,user_settings,user_settings_root,installation_overview,start_install,await_install,reboot_after_installation,reconnect_mgmt_console
-> https://openqa.suse.de/tests/overview?distri=sle&version=15-SP1&build=okurz_investigation_poo48260_current_master on current master
openqa-clone-custom-git-refspec https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6890 https://openqa.suse.de/tests/2488101
Created job #2495939: sle-15-SP1-Installer-DVD-s390x-Build175.4-skip_registration@s390x-kvm-sle12 -> https://openqa.suse.de/t2495939
env build=okurz_investigation_poo48260_current_master openqa-clone-set https://openqa.suse.de/tests/2495939 okurz_investigation_poo48260_reverted1ca24098 _SKIP_POST_FAIL_HOOKS=1 ROOTONLY=1 INCLUDE_MODULES=bootloader_zkvm,welcome,accept_license,skip_registration,addon_products_sle,partitioning,partitioning_finish,installer_timezone,user_settings,user_settings_root,installation_overview,start_install,await_install,reboot_after_installation,reconnect_mgmt_console
same build.
Updated by okurz almost 6 years ago
- Status changed from In Progress to Feedback
That is a pretty convincing result. "current master" fails 10/10, reverted 1ca24098 is 0/10 failed. https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6890 with that part reverted, merged.
Updated by okurz almost 6 years ago
Updated by okurz almost 6 years ago
Looks better so far. We should still improve the logging output, e.g. what I have running on lord.arch
Updated by okurz almost 6 years ago
Updated by okurz almost 6 years ago
- Status changed from Feedback to Resolved
PR merged. As we should not have seen further related failures recently I think we are done here. I still have plans to improve the user feedback e.g. with turning the incomplete into a fail with a hint to the test reviewer/developer in the future.
Updated by SLindoMansilla over 5 years ago
- Related to action #49958: [functional][u][s390x] test fails in reboot_after_installation - Failed to destroy domain openQA-SUT- added
Updated by szarate almost 4 years ago
- Subject changed from [sle][functional][u][s390x][kvm] test fails in reboot_after_installation - "The console isn't responding correctly. Maybe half-open socket?" - we all hate it ;) to [sle][functional][u][s390x][kvm] test fails in reboot_after_installation - "The console isn't responding correctly. Maybe half-open socket?"