action #44771
closed[tools][svirt] Can't call method "exec" on an undefined value
Added by michalnowak almost 6 years ago. Updated over 5 years ago.
0%
Description
Many svirt Hyper-V (and VMware I run locally) jobs with console/integration_services
test fail with:
Can't call method "exec" on an undefined value at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 583.
First such a failure on OSD is https://openqa.suse.de/tests/2277811#step/integration_services/6 from 23 Nov 2018.
The problem started to be more apparent on OSD since Dec 1st 19:00 UTC. Tests which started before that did not exhibit the issue (https://openqa.suse.de/tests/2296473#step/integration_services/16), those started after, do. (I don't know why it's so, maybe it's not relevant.)
The more substantial change in svirt between 20181115 and 20181129 deployments is https://github.com/os-autoinst/os-autoinst/pull/1048/.
The sub in question is https://github.com/os-autoinst/os-autoinst/blob/master/consoles/sshVirtsh.pm#L576.
I noticed that the issue exhibits only when the console/integration_services
test is scheduled after the VM is restarted (due to installation finished), when I schedule the test soon after installation/bootloader
test the issue does not happen. Console reset()
in action?
Updated by okurz almost 6 years ago
hm, @michalnowak you used the tag "[svirt]" but no further "team"-related tag in the subject line. Do you plan to pick it up yourself eventually or expect a team to track it?
Updated by michalnowak almost 6 years ago
I expect the team to take it as I ran out of ideas on an obvious fix.
Updated by mkittler almost 6 years ago
Do you already attempt to use the new serial console introduced in my PR? (Seems like that's not the case.)
So only Hyper-V and VMWare is affected and not qemu? So far I only used qemu to run tests locally. How would I reproduce this locally?
Console
reset()
in action?
Some debug printing couuld answer that question.
os-autoinst log around https://openqa.suse.de/tests/2296473#step/integration_services/16:
[2018-12-01T19:12:10.960 CET] [debug] <<< testapi::wait_serial(timeout=90, regexp='SCRIPT_FINISHEDGKlfe-\\d+-', record_output=1)
GKlfe
10.162.30.962620:113:80c0:80a0:10:162:29:9da9SCRIPT_FINISHEDGKlfe-0-
[2018-12-01T19:12:13.072 CET] [debug] >>> testapi::wait_serial: SCRIPT_FINISHEDGKlfe-\d+-: ok
[2018-12-01T19:12:13.073 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/console/integration_services.pm:21 called utils::integration_services_check
[2018-12-01T19:12:13.073 CET] [debug] <<< testapi::record_info(title='IP (host)', output=undef, result='ok')
Use of uninitialized value $output in print at /usr/lib/os-autoinst/basetest.pm line 433.
[2018-12-01T19:12:13.073 CET] [debug] /var/lib/openqa/cache/openqa.suse.de/tests/sle/tests/console/integration_services.pm:21 called utils::integration_services_check
[2018-12-01T19:12:13.073 CET] [debug] <<< testapi::record_info(title='IP (guest)', output='10.162.30.962620:113:80c0:80a0:10:162:29:9da9', result='ok')
Use of uninitialized value $ips_host_pov in string ne at /var/lib/openqa/cache/openqa.suse.de/tests/sle/lib/utils.pm line 171.
Use of uninitialized value $ips_host_pov in concatenation (.) or string at /var/lib/openqa/cache/openqa.suse.de/tests/sle/lib/utils.pm line 171.
[2018-12-01T19:12:13.188 CET] [debug] ips_host_pov=<> ips_guest_pov=<10.162.30.962620:113:80c0:80a0:10:162:29:9da9> at /var/lib/openqa/cache/openqa.suse.de/tests/sle/lib/utils.pm line 171.
[2018-12-01T19:12:13.188 CET] [debug] <<< testapi::select_console(testapi_console='log-console')
[2018-12-01T19:12:13.399 CET] [debug] activate_console, console: log-console, type: console
[2018-12-01T19:12:13.399 CET] [debug] <<< testapi::check_screen(mustmatch='any-console', timeout=10)
[2018-12-01T19:12:13.568 CET] [debug] >>> testapi::_handle_found_needle: found any-console-20180914, similarity 1.00 @ 55/754
[2018-12-01T19:12:13.568 CET] [debug] <<< testapi::assert_screen(mustmatch=[
'tty5-selected',
'text-logged-in-root'
], timeout=60)
[2018-12-01T19:12:14.652 CET] [debug] >>> testapi::_handle_found_needle: found tty5-selected-20180511, similarity 1.00 @ 47/113
Updated by michalnowak almost 6 years ago
- Related to action #44840: [hyperv] Use of uninitialized value in lib/utils integration_services_check_ip() added
Updated by michalnowak almost 6 years ago
mkittler wrote:
Do you already attempt to use the new serial console introduced in my PR? (Seems like that's not the case.)
Nope.
So only Hyper-V and VMWare is affected and not qemu? So far I only used qemu to run tests locally. How would I reproduce this locally?
qemu won't get to the code-path, it's related only to svirt backend. I think you can execute the code-path on KVM@svirt or Xen@svirt though I haven't seen the problem there as integration_services_check_ip()
is not executed in this context. Just run console('svirt')->get_cmd_output()
after installation, I guess? Note that it's not 100% reproducible. If it's not reproducible on KVM or Xen at all, I can help you set up Hyper-V@svirt.
os-autoinst log around https://openqa.suse.de/tests/2296473#step/integration_services/16:
I think it comes out from the 'exec' problem , but shouldn't be related to the original issue.
Updated by okurz almost 6 years ago
- Subject changed from [svirt] Can't call method "exec" on an undefined value to [tools][svirt] Can't call method "exec" on an undefined value
"tools"-team then I think?
Updated by michalnowak almost 6 years ago
I'd like this to be investigated by the tools team as it failed six times - majority of cases - in Virtualization groups (the Beta1 build - 125.1) and affects validation.
Updated by coolo almost 6 years ago
- Target version set to Ready
We should have enough staffing for that in June.
Updated by okurz almost 6 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: textmode@svirt-hyperv2012r2-uefi
https://openqa.suse.de/tests/2341720
Updated by mkittler almost 6 years ago
The relevant line is now 643:
sub get_cmd_output {
my ($self, $cmd, $args) = @_;
my $wantarray = $args->{wantarray};
my $domain = $args->{domain} // 'ssh';
my $chan = $self->{$domain}->channel();
-> $chan->exec($cmd);
bmwqemu::diag "Command executed: $cmd";
my @cmd_output = get_ssh_output($chan);
$chan->send_eof;
$chan->close();
return $wantarray ? \@cmd_output : $cmd_output[0];
}
Obviously the Net::SSH2
object exists but the ->channel()
method returned undef
instead of a Net::SSH2::Channel
object. So "Can't call method "exec" on an undefined value" actually tells us that we could not create an SSH channel. From the documentation of Net::SSH2
it is hard to tell whether returning undef
is intended and how one would obtain further information about the error.
The ->channel()
method is implemented in C (see https://metacpan.org/source/SALVA/Net-SSH2-0.69/SSH2.xs#L1410):
SSH2_CHANNEL*
net_ss_channel(SSH2* ss, SSH2_CHARP_OR_NULL channel_type = NULL, \
int window_size = LIBSSH2_CHANNEL_WINDOW_DEFAULT, \
int packet_size = LIBSSH2_CHANNEL_PACKET_DEFAULT)
PREINIT:
static const char mandatory_type[] = "session";
CODE:
if (channel_type && strcmp(channel_type, mandatory_type))
Perl_croak(aTHX_ "channel_type must be 'session' ('%s' given)", channel_type);
NEW_CHANNEL(libssh2_channel_open_ex(ss->session,
mandatory_type, strlen(mandatory_type), window_size, packet_size,
NULL/*message*/, 0/*message_len*/));
OUTPUT:
RETVAL
And NEW_CHANNEL
is a 'nice' multi-line macro:
/* general wrapper */
#define NEW_ITEM(type, field, create, parent) do { \
Newz(0/*id*/, RETVAL, 1, type); \
if (RETVAL) { \
RETVAL->parent = parent; \
RETVAL->sv_##parent = SvREFCNT_inc(SvRV(ST(0))); \
RETVAL->field = create; \
debug(#create " -> 0x%p\n", RETVAL->field); \
} \
if (!RETVAL || !RETVAL->field) { \
if (RETVAL) \
SvREFCNT_dec(RETVAL->sv_##parent); \
Safefree(RETVAL); \
XSRETURN_EMPTY; \
} \
} while(0)
/* wrap a libSSH2 channel */
#define NEW_CHANNEL(create) NEW_ITEM(SSH2_CHANNEL, channel, create, ss)
I'm still not sure under which conditions it would return NULL
/undef
.
Updated by michalnowak almost 6 years ago
Something I noted in the original problem description:
I noticed that the issue exhibits only when the console/integration_services test is scheduled after the VM is restarted (due to installation finished), when I schedule the test soon after installation/bootloader test the issue does not happen. Console reset() in action?
Can it be that some object related to this gets destroyed by action on VM restart?
Updated by okurz almost 6 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: textmode@svirt-hyperv2012r2-uefi
https://openqa.suse.de/tests/2407538
Updated by mkittler almost 6 years ago
- Assignee set to mkittler
Can it be that some object related to this gets destroyed by action on VM restart?
Might be.
Maybe I can also reproduce this locally using qemu (via svirt). The affected code doesn't seem to be hyper-v specific at least.
Updated by mkittler almost 6 years ago
- Target version changed from Ready to Current Sprint
Locally with svirt/qemu the test only runs until prepare_test_data
because of a download issue. But it seems that the relevant module isn't even scheduled anyways for svirt/qemu (according to loadtest 'console/integration_services' if is_hyperv || is_vmware;
in main.pm
).
So I tried svirt/hyperv locally with openqaw7-hyperv.qa.suse.de
but the relevant module integration_services
passed. It seems like the issue doesn't occur always (which makes it even harder to track). Also on OSD we sometimes have tests which pass (e.g. https://openqa.suse.de/tests/2430299).
Maybe the issue occurs when there's a connection issue like a timeout. But I don't think it has something to us restarting the VM. At least I would expect it to fail always then. Or do we restart the VM only sometimes?
We could prevent the issue by simply testing for undef
before and try to re-establish the connection. @michalnowak What do you think?
Updated by michalnowak almost 6 years ago
mkittler wrote:
So I tried svirt/hyperv locally with
openqaw7-hyperv.qa.suse.de
but the relevant moduleintegration_services
passed. It seems like the issue doesn't occur always (which makes it even harder to track). Also on OSD we sometimes have tests which pass (e.g. https://openqa.suse.de/tests/2430299).
I think you were exceptionally lucky. Failed twice yesterday: https://openqa.suse.de/tests/overview?distri=sle&version=15-SP1&build=161.1&groupid=115.
Maybe the issue occurs when there's a connection issue like a timeout. But I don't think it has something to us restarting the VM. At least I would expect it to fail always then. Or do we restart the VM only sometimes?
Nope, always restarts. It's like 90 % failure rate. I don't think it's timeout as serial console (which is SSH connection as well) works https://openqa.suse.de/tests/2450742#step/integration_services/6 right after the failure.
We could prevent the issue by simply testing for
undef
before and try to re-establish the connection. @michalnowak What do you think?
Feel free to.
Updated by mkittler almost 6 years ago
I tried again locally and the module passes again. I give it another try to reproduce.
Feel free to.
If I can't reproduce and therefore can't find the source of the problem I suppose that's the only option we've got.
Updated by mkittler almost 6 years ago
Locally 5 of 5 runs passed. Maybe I'm really lucks but somehow I doubt that.
From the documentation it looks that the ->channel()
method is actually supposed to create a new channel already. According to my previous investigation it isn't clear to me in which cases that would fail and return undef
instead. Maybe we could workaround the issue by constructing a completely new Net::SSH2
object.
Updated by michalnowak almost 6 years ago
May I see such a passed job?
What's the OS of the worker? (That seems to be the difference in here.) OSD are, I think, some sort of 42.3/12 SP3, my worker is Leap 15.0.
Updated by mkittler almost 6 years ago
I'm currently only in VPN. Right now it would be accessible via this IP: http://10.163.2.112/tests/latest?distri=sle&test=textmode&flavor=Installer-DVD&arch=x86_64&machine=svirt-hyperv2012r2&version=15-SP1#step/integration_services/8
You can use Next & Previous tab to see the other jobs. But they don't look different. Everything (except the Hyper-V host of course) run on Tumbleweed.
Updated by mkittler almost 6 years ago
That seems to be the difference in here.
And most likely the version of the perl-Net-SSH2 package is relevant here. On openqaworker2 it is still 0.62. On Tumbleweed I'm already at 0.69. Maybe that makes a difference although I can not find anything obvious when reading the diff between the versions.
Updated by michalnowak almost 6 years ago
With this change it passed 3/3 on VMware, where it otherwise nearly always fails for me:
diff --git a/lib/utils.pm b/lib/utils.pm
index b96ca1820..bc70b32df 100644
--- a/lib/utils.pm
+++ b/lib/utils.pm
@@ -154,6 +154,8 @@ Check that guest IP address that host and guest see is the same.
=cut
sub integration_services_check_ip {
# Host-side of Integration Services
+ select_console('svirt');
+ select_console('sut');
my $vmname = console('svirt')->name;
my $ips_host_pov;
if (check_var('VIRSH_VMM_FAMILY', 'hyperv')) {
I guess the SSH object is gone after reset_consoles
from power_action()
. Maybe reset_consoles
might not reset 'svirt' console?
Updated by michalnowak almost 6 years ago
Merged the workaround https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/6794. It was 100% effective in my testing. Will see how it goes in OSD.
Updated by mkittler almost 6 years ago
Since your worker uses the same version of the perl-Net-SSH2 package that shouldn't make a difference. Maybe the Hyper-V host I've been using makes a difference compared to production.
If a workaround can be applied that simple in the test distro I'd say it is not worth adding yet another workaround in os-autoinst itself (like the idea to create a new Net::SSH2
object which would likely be the same than your select_console
workaround).
Updated by michalnowak almost 6 years ago
mkittler wrote:
Since your worker uses the same version of the perl-Net-SSH2 package that shouldn't make a difference. Maybe the Hyper-V host I've been using makes a difference compared to production.
From what I've seen at your openQA instance (linked in https://progress.opensuse.org/issues/44771#note-19, inaccessible for me now), you used the very same Hyper-V 2012 R2 host used in production. One thing to mention, that host is in Prague server room, rest of the infra in Nue (but I had same problem with VMware host in case all the infra is in Nue, hopefully in the same VLAN as well).
If a workaround can be applied that simple in the test distro I'd say it is not worth adding yet another workaround in os-autoinst itself (like the idea to create a new
Net::SSH2
object which would likely be the same than yourselect_console
workaround).
Agreed. I don't think it's that useful.
Excluding 'svirt' console from reset_consoles
might do the trick, omitting select_console
juggling in test code, but there might be consequences to that I am not aware of.
Updated by okurz almost 6 years ago
michalnowak wrote:
Excluding 'svirt' console from
reset_consoles
might do the trick
If that is not the case I suggest to do that, exclude 'svirt' from the consoles to be resetted. Same as should be done for e.g. the s3270 one in case of s390x z/VM as we never expect that console to vanish during test execution regardless of the state of the VM.
Updated by michalnowak almost 6 years ago
Excluding 'svirt' and 'sut' consoles from reset does not help. With some diag
code it seems that they are not being reset in the first place. Will keep the workaround in test code in place.
Updated by mkittler almost 6 years ago
That PR will change the relevant code a little bit (but shouldn't alter the behavior): https://github.com/os-autoinst/os-autoinst/pull/1113
Updated by mkittler over 5 years ago
- Status changed from New to In Progress
I mentioned in the previous comments that the documentation lacks information about error handling. However, this is actually covered in SYNOPSIS
of the channel object (instead of the description of the ->channel()
method itself):
my $chan = $ssh2->channel()
or $ssh2->die_with_error;
$chan->exec("ls -ld /usr/local/libssh2*")
or $ssh2->die_with_error;
$chan->send_eof;
So we should handle ->channel()
being undef
in our code as well. And the error handling for ->exec()
should be improved as well. However, in our case it makes sense not do die instantly but to re-establish the SSH connection. I suppose that's also what your workaround does because the select_console
likely executes consoles::sshVirtsh::activate
which establishes a new SSH connection, right?
By the way, the error handling of get_cmd_output
should be improved in general. For instance if one passes an invalid domain we would right now get the not so nice error message Can't call method "channel" on an undefined value
.
michalnowak wrote:
Excluding 'svirt' and 'sut' consoles from reset does not help.
Judging by the os-autoinst code that is no surprise. It looks like resetting a console only sets the activated flag to zero again. But it would not call Net::SSH2->disconnect
or do something else which would prevent creation of further channels.
So I'm going to improve the error handling of get_cmd_output
in the way I've described.
Updated by mkittler over 5 years ago
PR: https://github.com/os-autoinst/os-autoinst/pull/1126 - maybe someone wants to help me testing?
Updated by mkittler over 5 years ago
- Status changed from In Progress to Feedback
The PR has been merged. So the error handling has been improved in general and we should get a more useful error message. Besides, to fix the particular problem reported here the PR implements an SSH reconnect on the os-autoinst side if the channel creation fails. This should make the workaround within the test distribution no longer necessary.
Updated by mkittler over 5 years ago
- Status changed from Feedback to Resolved
I guess the error is gone. (Actually hard to tell because @michalnowak's workaround is likely still in place.)
Updated by mloviska almost 4 years ago
- Related to action #81878: [jeos] Failed to create ssh channel to vmware host in integration_services module added