action #81878
closed[jeos] Failed to create ssh channel to vmware host in integration_services module
0%
Description
Observation¶
openQA test in scenario sle-15-SP3-JeOS-for-VMware-x86_64-jeos-filesystem_xenhvm@svirt-vmware65 fails in
integration_services
Test fails to retrieve data about running VM on the host.
Unable to create SSH channel for executing "set -x; vmid=$(vim-cmd vmsvc/getallvms | awk '/openQA-SUT-1/ { print $1 }');if [ $vmid ]; then vim-cmd vmsvc/get.guest $vmid | awk '/ipAddress/ {print $3}' | head -n1 | sed -e 's/"//g' | sed -e 's/,//g'; fi": no libssh2 error registered at /usr/lib/os-autoinst/backend/baseclass.pm line 1416.
Reproducible¶
Fails since (at least) Build 20.139
Expected result¶
Last good: 20.100 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by mloviska almost 4 years ago
- Related to action #44771: [tools][svirt] Can't call method "exec" on an undefined value added
Updated by tjyrinki_suse almost 4 years ago
- Subject changed from Failed to create ssh channel to vmware host in integration_services module to [jeos] Failed to create ssh channel to vmware host in integration_services module
- Start date deleted (
2021-01-08)
Updated by okurz almost 4 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: jeos-filesystem_xenhvm@svirt-vmware65
https://openqa.suse.de/tests/5423352
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released"
- The label in the openQA scenario is removed
Updated by nanzhang almost 4 years ago
The same issue can be reproduced in sle-15-SP3-Online-x86_64-Build136.1-textmode_svirt@svirt-vmware65.
http://10.67.129.66/tests/38#step/integration_services/8
Please take this issue with priority, as we need to enable related tests for 3rd party hypervisors in OSD as per VT QA roadmap.
Updated by ph03nix almost 4 years ago
- Status changed from New to In Progress
- Assignee set to ph03nix
Updated by ph03nix almost 4 years ago
This part is only failing, if we run the consoletest_setup test.
Passing without consoletest_setup: http://duck-norris.qam.suse.de/tests/5069#
Failing with consoletest_setup: http://duck-norris.qam.suse.de/tests/5070#step/integration_services/5
Updated by ybonatakis almost 4 years ago
- Assignee changed from ph03nix to ybonatakis
Updated by ybonatakis almost 4 years ago
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/12008
The above PR is a small workaround to the issue.
Moving the integration_services module early in the scheduling seems to
fix the problem.
From my observations, the module looks to reuse the existing ssh connection but for some reason that it is not obvious in the logs, fails to create a new channel ($ssh->channel from NET::SSH2). Based on Marius this might be because "it depends on the state of the already opened connection whether opening a further channel is possible at this point."
One of my first thoughts was to replace NET::SSH2 with Net::OpenSSH, which consider more stable. However this needs changes in os-autoinst for starters.
Then, other options might be to create a new ssh connection. Unfortunately i am not sure how to do that in first touch.
I decided to play around with the options that are passed into the run_ssh_cmd
which is where the failure appears.
[0m[2021-02-22T18:30:15.968 CET] [debug] <<< backend::baseclass::run_ssh_cmd(cmd="virsh -c esx://root\@openqaw8-vmware.qa.suse.de/?no_verify=1\\&authfile=/tmp/l657F9fdDQ undefine --snapshots-metadata openQA-SUT-15", wantarray=0, keep_open=1)
So i modified the chain of the fallbacks subroutines using keep_open=0
. This didnt seem to have any affect.
In addtion, the authfile looks like not a valid parameter, which i can confirm from the documentation(https://libvirt.org/drvesx.html#extraparams)
This raise a constantly warning
2021-02-11 16:06:53.939+0000: 4723: warning : esxUtil_ParseUri:149 : Ignoring unexpected query parameter 'authfile'
However i am not sure how related is with the issue.
Note for me: http://aquarius.suse.cz/tests/4972/file/autoinst-log.txt :different error?
Updated by jlausuch almost 4 years ago
- Project changed from openQA Tests (public) to 208
- Category deleted (
Bugs in existing tests)
Updated by mkittler over 3 years ago
I've been creating a link for Net::SSH2
in our devel repo: osc linkpac openSUSE:Factory perl-Net-SSH2 devel:openQA:Leap:15.2
So the production workers will be updated to the most recent version on the next deployment. I've also just updated the package on openqaworker2
manually.
Updated by ybonatakis over 3 years ago
mkittler wrote:
I've been creating a link for
Net::SSH2
in our devel repo:osc linkpac openSUSE:Factory perl-Net-SSH2 devel:openQA:Leap:15.2
So the production workers will be updated to the most recent version on the next deployment. I've also just updated the package on
openqaworker2
manually.
i just want to add version update is from 0.69 to 0.72 for the record.
Updated by ybonatakis over 3 years ago
The module fails when it tries to create another channel for the selected ssh connection. What we see is
Net::SSH2=SCALAR(0x55aea0041f00)
libssh2_channel_open_ex(ss->session, mandatory_type, strlen(mandatory_type), window_size, packet_size, ((void *)0) , 0 ) -> 0x0
when this happens the ssh service produce a record that point to the root of the error
Apr 10 09:18:29 openqaw5-xen sshd[1998]: Accepted keyboard-interactive/pam for root from 10.100.224.110 port 38502 ssh2
Apr 10 09:18:48 openqaw5-xen sshd[2038]: Accepted keyboard-interactive/pam for root from 10.100.224.110 port 38504 ssh2
Apr 10 09:22:30 openqaw5-xen sshd[2102]: error: kex_exchange_identification: Connection closed by remote host
Looking in the source code of openssh[0] we can discover thatt the error shows up when the error code is eq with EPIPE. Turning to the definition of that specific code[1]
Macro: int EPIPE
“Broken pipe.” There is no process reading from the other end of a pipe. Every library function that returns this error code also generates a SIGPIPE signal; this signal terminates the program if not handled or blocked. Thus, your program will never actually see EPIPE unless it has handled or blocked SIGPIPE.
So something between the server and the client is interrupted during the kex_exchange_identification
function of openssh which exchange the identification strings between them.
i tried to captured some traffic for analysis but my skills are limited here. Ask me for the pcap for anyone who can take a look(you might need to create a new one however and retrieve the keys for the data decryption. i cant upload the file file due to the size limits of the file (>20M)).
Having the above finding the only thing i can think to do for now is to schedule the module earlier for jeos which seems to work in the majority of the VRs. or add a softfail(??)
[0] https://github.com/openssh/openssh-portable/blob/V_8_2/kex.c#L1247
[1] https://www.gnu.org/software/libc/manual/html_node/Error-Codes.html
Updated by cfconrad over 3 years ago
I wasn't able to fix it in os-autoinst-distri-opensuse, but one of the following give use a solution:
https://github.com/os-autoinst/os-autoinst/pull/1643 (just an improvement, to make the run_cmd()
more configurable)
https://github.com/os-autoinst/os-autoinst/pull/1642 (preferred)
Updated by ybonatakis over 3 years ago
- Status changed from In Progress to Feedback
new PR based on the Clemens changes https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/12332
Updated by nanzhang over 3 years ago
Latest run passed on build 176.1(snapshot15 candidate) without the ssh issue, we may need one more build to verify the stability.
Updated by cfconrad over 3 years ago
With this https://github.com/os-autoinst/os-autoinst/pull/1660 , we are now also able to call the console('svirt')->run_cmd($cmd, wantarray => 1, keep_open => 0)
directly.
Updated by ph03nix about 1 month ago
- Project changed from 208 to Containers and images