action #44771

[tools][svirt] Can't call method "exec" on an undefined value

Added by michalnowak about 1 year ago. Updated 11 months ago.

Status:ResolvedStart date:05/12/2018
Priority:NormalDue date:
Assignee:mkittler% Done:

0%

Category:Concrete Bugs
Target version:Current Sprint
Difficulty:
Duration:

Description

Many svirt Hyper-V (and VMware I run locally) jobs with console/integration_services test fail with:

Can't call method "exec" on an undefined value at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 583.

First such a failure on OSD is https://openqa.suse.de/tests/2277811#step/integration_services/6 from 23 Nov 2018.

The problem started to be more apparent on OSD since Dec 1st 19:00 UTC. Tests which started before that did not exhibit the issue (https://openqa.suse.de/tests/2296473#step/integration_services/16), those started after, do. (I don't know why it's so, maybe it's not relevant.)

The more substantial change in svirt between 20181115 and 20181129 deployments is https://github.com/os-autoinst/os-autoinst/pull/1048/.

The sub in question is https://github.com/os-autoinst/os-autoinst/blob/master/consoles/sshVirtsh.pm#L576.

I noticed that the issue exhibits only when the console/integration_services test is scheduled after the VM is restarted (due to installation finished), when I schedule the test soon after installation/bootloader test the issue does not happen. Console reset() in action?


Related issues

Related to openQA Tests - action #44840: [hyperv] Use of uninitialized value in lib/utils integrat... Resolved 06/12/2018

History

#1 Updated by okurz about 1 year ago

hm, @michalnowak you used the tag "[svirt]" but no further "team"-related tag in the subject line. Do you plan to pick it up yourself eventually or expect a team to track it?

#2 Updated by michalnowak about 1 year ago

I expect the team to take it as I ran out of ideas on an obvious fix.

#3 Updated by mkittler about 1 year ago

Do you already attempt to use the new serial console introduced in my PR? (Seems like that's not the case.)

So only Hyper-V and VMWare is affected and not qemu? So far I only used qemu to run tests locally. How would I reproduce this locally?

Console reset() in action?

Some debug printing couuld answer that question.


os-autoinst log around https://openqa.suse.de/tests/2296473#step/integration_services/16:

[2018-12-01T19:12:10.960 CET] [debug] <<< testapi::wait_serial(timeout=90, regexp='SCRIPT_FINISHEDGKlfe-\\d+-', record_output=1)
GKlfe

10.162.30.962620:113:80c0:80a0:10:162:29:9da9SCRIPT_FINISHEDGKlfe-0-

[2018-12-01T19:12:13.072 CET] [debug] >>> testapi::wait_serial: SCRIPT_FINISHEDGKlfe-\d+-: ok
[2018-12-01T19:12:13.073 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/console/integration_services.pm:21 called utils::integration_services_check
[2018-12-01T19:12:13.073 CET] [debug] <<< testapi::record_info(title='IP (host)', output=undef, result='ok')
Use of uninitialized value $output in print at /usr/lib/os-autoinst/basetest.pm line 433.
[2018-12-01T19:12:13.073 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/console/integration_services.pm:21 called utils::integration_services_check
[2018-12-01T19:12:13.073 CET] [debug] <<< testapi::record_info(title='IP (guest)', output='10.162.30.962620:113:80c0:80a0:10:162:29:9da9', result='ok')
Use of uninitialized value $ips_host_pov in string ne at /var/lib/openqa/cache/openqa.suse.de/tests/sle/lib/utils.pm line 171.
Use of uninitialized value $ips_host_pov in concatenation (.) or string at /var/lib/openqa/cache/openqa.suse.de/tests/sle/lib/utils.pm line 171.
[2018-12-01T19:12:13.188 CET] [debug] ips_host_pov=<> ips_guest_pov=<10.162.30.962620:113:80c0:80a0:10:162:29:9da9> at /var/lib/openqa/cache/openqa.suse.de/tests/sle/lib/utils.pm line 171.

[2018-12-01T19:12:13.188 CET] [debug] <<< testapi::select_console(testapi_console='log-console')
[2018-12-01T19:12:13.399 CET] [debug] activate_console, console: log-console, type: console
[2018-12-01T19:12:13.399 CET] [debug] <<< testapi::check_screen(mustmatch='any-console', timeout=10)
[2018-12-01T19:12:13.568 CET] [debug] >>> testapi::_handle_found_needle: found any-console-20180914, similarity 1.00 @ 55/754
[2018-12-01T19:12:13.568 CET] [debug] <<< testapi::assert_screen(mustmatch=[
  'tty5-selected',
  'text-logged-in-root'
], timeout=60)
[2018-12-01T19:12:14.652 CET] [debug] >>> testapi::_handle_found_needle: found tty5-selected-20180511, similarity 1.00 @ 47/113

#4 Updated by michalnowak about 1 year ago

  • Related to action #44840: [hyperv] Use of uninitialized value in lib/utils integration_services_check_ip() added

#5 Updated by michalnowak about 1 year ago

mkittler wrote:

Do you already attempt to use the new serial console introduced in my PR? (Seems like that's not the case.)

Nope.

So only Hyper-V and VMWare is affected and not qemu? So far I only used qemu to run tests locally. How would I reproduce this locally?

qemu won't get to the code-path, it's related only to svirt backend. I think you can execute the code-path on KVM@svirt or Xen@svirt though I haven't seen the problem there as integration_services_check_ip() is not executed in this context. Just run console('svirt')->get_cmd_output() after installation, I guess? Note that it's not 100% reproducible. If it's not reproducible on KVM or Xen at all, I can help you set up Hyper-V@svirt.

os-autoinst log around https://openqa.suse.de/tests/2296473#step/integration_services/16:

I think it comes out from the 'exec' problem , but shouldn't be related to the original issue.

#6 Updated by okurz about 1 year ago

  • Subject changed from [svirt] Can't call method "exec" on an undefined value to [tools][svirt] Can't call method "exec" on an undefined value

"tools"-team then I think?

#7 Updated by michalnowak about 1 year ago

I'd like this to be investigated by the tools team as it failed six times - majority of cases - in Virtualization groups (the Beta1 build - 125.1) and affects validation.

#8 Updated by coolo about 1 year ago

  • Target version set to Ready

We should have enough staffing for that in June.

#9 Updated by okurz about 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode@svirt-hyperv2012r2-uefi
https://openqa.suse.de/tests/2341720

#10 Updated by mkittler about 1 year ago

The relevant line is now 643:

sub get_cmd_output {
    my ($self, $cmd, $args) = @_;

    my $wantarray = $args->{wantarray};
    my $domain    = $args->{domain} // 'ssh';
    my $chan      = $self->{$domain}->channel();
->  $chan->exec($cmd);
    bmwqemu::diag "Command executed: $cmd";
    my @cmd_output = get_ssh_output($chan);
    $chan->send_eof;
    $chan->close();
    return $wantarray ? \@cmd_output : $cmd_output[0];
}

Obviously the Net::SSH2 object exists but the ->channel() method returned undef instead of a Net::SSH2::Channel object. So "Can't call method "exec" on an undefined value" actually tells us that we could not create an SSH channel. From the documentation of Net::SSH2 it is hard to tell whether returning undef is intended and how one would obtain further information about the error.

The ->channel() method is implemented in C (see https://metacpan.org/source/SALVA/Net-SSH2-0.69/SSH2.xs#L1410):

SSH2_CHANNEL*
net_ss_channel(SSH2* ss,  SSH2_CHARP_OR_NULL channel_type = NULL,   \
               int window_size = LIBSSH2_CHANNEL_WINDOW_DEFAULT,    \
               int packet_size = LIBSSH2_CHANNEL_PACKET_DEFAULT)
PREINIT:
    static const char mandatory_type[] = "session";
CODE:
    if (channel_type && strcmp(channel_type, mandatory_type))
        Perl_croak(aTHX_ "channel_type must be 'session' ('%s' given)", channel_type);
    NEW_CHANNEL(libssh2_channel_open_ex(ss->session,
        mandatory_type, strlen(mandatory_type), window_size, packet_size,
        NULL/*message*/, 0/*message_len*/));
OUTPUT:
    RETVAL

And NEW_CHANNEL is a 'nice' multi-line macro:

/* general wrapper */
#define NEW_ITEM(type, field, create, parent) do { \
    Newz(0/*id*/, RETVAL, 1, type); \
    if (RETVAL) { \
        RETVAL->parent = parent; \
        RETVAL->sv_##parent = SvREFCNT_inc(SvRV(ST(0))); \
        RETVAL->field = create; \
        debug(#create " -> 0x%p\n", RETVAL->field); \
    } \
    if (!RETVAL || !RETVAL->field) { \
        if (RETVAL) \
            SvREFCNT_dec(RETVAL->sv_##parent); \
        Safefree(RETVAL); \
        XSRETURN_EMPTY; \
    } \
} while(0)

/* wrap a libSSH2 channel */
#define NEW_CHANNEL(create) NEW_ITEM(SSH2_CHANNEL, channel, create, ss)

I'm still not sure under which conditions it would return NULL/undef.

#11 Updated by michalnowak about 1 year ago

Something I noted in the original problem description:

I noticed that the issue exhibits only when the console/integration_services test is scheduled after the VM is restarted (due to installation finished), when I schedule the test soon after installation/bootloader test the issue does not happen. Console reset() in action?

Can it be that some object related to this gets destroyed by action on VM restart?

#12 Updated by okurz about 1 year ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: textmode@svirt-hyperv2012r2-uefi
https://openqa.suse.de/tests/2407538

#13 Updated by mkittler about 1 year ago

  • Assignee set to mkittler

Can it be that some object related to this gets destroyed by action on VM restart?

Might be.

Maybe I can also reproduce this locally using qemu (via svirt). The affected code doesn't seem to be hyper-v specific at least.

#14 Updated by mkittler about 1 year ago

  • Target version changed from Ready to Current Sprint

Locally with svirt/qemu the test only runs until prepare_test_data because of a download issue. But it seems that the relevant module isn't even scheduled anyways for svirt/qemu (according to loadtest 'console/integration_services' if is_hyperv || is_vmware; in main.pm).

So I tried svirt/hyperv locally with openqaw7-hyperv.qa.suse.de but the relevant module integration_services passed. It seems like the issue doesn't occur always (which makes it even harder to track). Also on OSD we sometimes have tests which pass (e.g. https://openqa.suse.de/tests/2430299).

Maybe the issue occurs when there's a connection issue like a timeout. But I don't think it has something to us restarting the VM. At least I would expect it to fail always then. Or do we restart the VM only sometimes?

We could prevent the issue by simply testing for undef before and try to re-establish the connection. @michalnowak What do you think?

#15 Updated by michalnowak about 1 year ago

mkittler wrote:

So I tried svirt/hyperv locally with openqaw7-hyperv.qa.suse.de but the relevant module integration_services passed. It seems like the issue doesn't occur always (which makes it even harder to track). Also on OSD we sometimes have tests which pass (e.g. https://openqa.suse.de/tests/2430299).

I think you were exceptionally lucky. Failed twice yesterday: https://openqa.suse.de/tests/overview?distri=sle&version=15-SP1&build=161.1&groupid=115.

Maybe the issue occurs when there's a connection issue like a timeout. But I don't think it has something to us restarting the VM. At least I would expect it to fail always then. Or do we restart the VM only sometimes?

Nope, always restarts. It's like 90 % failure rate. I don't think it's timeout as serial console (which is SSH connection as well) works https://openqa.suse.de/tests/2450742#step/integration_services/6 right after the failure.

We could prevent the issue by simply testing for undef before and try to re-establish the connection. @michalnowak What do you think?

Feel free to.

#16 Updated by mkittler about 1 year ago

I tried again locally and the module passes again. I give it another try to reproduce.

Feel free to.

If I can't reproduce and therefore can't find the source of the problem I suppose that's the only option we've got.

#17 Updated by mkittler about 1 year ago

Locally 5 of 5 runs passed. Maybe I'm really lucks but somehow I doubt that.

From the documentation it looks that the ->channel() method is actually supposed to create a new channel already. According to my previous investigation it isn't clear to me in which cases that would fail and return undef instead. Maybe we could workaround the issue by constructing a completely new Net::SSH2 object.

#18 Updated by michalnowak about 1 year ago

May I see such a passed job?

What's the OS of the worker? (That seems to be the difference in here.) OSD are, I think, some sort of 42.3/12 SP3, my worker is Leap 15.0.

#19 Updated by mkittler about 1 year ago

I'm currently only in VPN. Right now it would be accessible via this IP: http://10.163.2.112/tests/latest?distri=sle&test=textmode&flavor=Installer-DVD&arch=x86_64&machine=svirt-hyperv2012r2&version=15-SP1#step/integration_services/8

You can use Next & Previous tab to see the other jobs. But they don't look different. Everything (except the Hyper-V host of course) run on Tumbleweed.

#20 Updated by mkittler about 1 year ago

That seems to be the difference in here.

And most likely the version of the perl-Net-SSH2 package is relevant here. On openqaworker2 it is still 0.62. On Tumbleweed I'm already at 0.69. Maybe that makes a difference although I can not find anything obvious when reading the diff between the versions.

#21 Updated by michalnowak about 1 year ago

With this change it passed 3/3 on VMware, where it otherwise nearly always fails for me:

diff --git a/lib/utils.pm b/lib/utils.pm
index b96ca1820..bc70b32df 100644
--- a/lib/utils.pm
+++ b/lib/utils.pm
@@ -154,6 +154,8 @@ Check that guest IP address that host and guest see is the same.
 =cut
 sub integration_services_check_ip {
     # Host-side of Integration Services
+    select_console('svirt');
+    select_console('sut');
     my $vmname = console('svirt')->name;
     my $ips_host_pov;
     if (check_var('VIRSH_VMM_FAMILY', 'hyperv')) {

I guess the SSH object is gone after reset_consoles from power_action(). Maybe reset_consoles might not reset 'svirt' console?

#22 Updated by michalnowak about 1 year ago

Merged the workaround https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6794. It was 100% effective in my testing. Will see how it goes in OSD.

#23 Updated by mkittler about 1 year ago

Since your worker uses the same version of the perl-Net-SSH2 package that shouldn't make a difference. Maybe the Hyper-V host I've been using makes a difference compared to production.

If a workaround can be applied that simple in the test distro I'd say it is not worth adding yet another workaround in os-autoinst itself (like the idea to create a new Net::SSH2 object which would likely be the same than your select_console workaround).

#24 Updated by michalnowak about 1 year ago

mkittler wrote:

Since your worker uses the same version of the perl-Net-SSH2 package that shouldn't make a difference. Maybe the Hyper-V host I've been using makes a difference compared to production.

From what I've seen at your openQA instance (linked in https://progress.opensuse.org/issues/44771#note-19, inaccessible for me now), you used the very same Hyper-V 2012 R2 host used in production. One thing to mention, that host is in Prague server room, rest of the infra in Nue (but I had same problem with VMware host in case all the infra is in Nue, hopefully in the same VLAN as well).

If a workaround can be applied that simple in the test distro I'd say it is not worth adding yet another workaround in os-autoinst itself (like the idea to create a new Net::SSH2 object which would likely be the same than your select_console workaround).

Agreed. I don't think it's that useful.

Excluding 'svirt' console from reset_consoles might do the trick, omitting select_console juggling in test code, but there might be consequences to that I am not aware of.

#25 Updated by okurz about 1 year ago

michalnowak wrote:

Excluding 'svirt' console from reset_consoles might do the trick

If that is not the case I suggest to do that, exclude 'svirt' from the consoles to be resetted. Same as should be done for e.g. the s3270 one in case of s390x z/VM as we never expect that console to vanish during test execution regardless of the state of the VM.

#26 Updated by michalnowak about 1 year ago

Excluding 'svirt' and 'sut' consoles from reset does not help. With some diag code it seems that they are not being reset in the first place. Will keep the workaround in test code in place.

#27 Updated by mkittler about 1 year ago

That PR will change the relevant code a little bit (but shouldn't alter the behavior): https://github.com/os-autoinst/os-autoinst/pull/1113

#28 Updated by mkittler 12 months ago

  • Status changed from New to In Progress

I mentioned in the previous comments that the documentation lacks information about error handling. However, this is actually covered in SYNOPSIS of the channel object (instead of the description of the ->channel() method itself):

my $chan = $ssh2->channel()
  or $ssh2->die_with_error;

$chan->exec("ls -ld /usr/local/libssh2*")
  or $ssh2->die_with_error;

$chan->send_eof;

So we should handle ->channel() being undef in our code as well. And the error handling for ->exec() should be improved as well. However, in our case it makes sense not do die instantly but to re-establish the SSH connection. I suppose that's also what your workaround does because the select_console likely executes consoles::sshVirtsh::activate which establishes a new SSH connection, right?

By the way, the error handling of get_cmd_output should be improved in general. For instance if one passes an invalid domain we would right now get the not so nice error message Can't call method "channel" on an undefined value.

michalnowak wrote:

Excluding 'svirt' and 'sut' consoles from reset does not help.

Judging by the os-autoinst code that is no surprise. It looks like resetting a console only sets the activated flag to zero again. But it would not call Net::SSH2->disconnect or do something else which would prevent creation of further channels.


So I'm going to improve the error handling of get_cmd_output in the way I've described.

#29 Updated by mkittler 12 months ago

PR: https://github.com/os-autoinst/os-autoinst/pull/1126 - maybe someone wants to help me testing?

#30 Updated by mkittler 11 months ago

  • Status changed from In Progress to Feedback

The PR has been merged. So the error handling has been improved in general and we should get a more useful error message. Besides, to fix the particular problem reported here the PR implements an SSH reconnect on the os-autoinst side if the channel creation fails. This should make the workaround within the test distribution no longer necessary.

#31 Updated by mkittler 11 months ago

  • Status changed from Feedback to Resolved

I guess the error is gone. (Actually hard to tell because @michalnowak's workaround is likely still in place.)

Also available in: Atom PDF