Project

General

Profile

Actions

action #81878

closed

[jeos] Failed to create ssh channel to vmware host in integration_services module

Added by mloviska about 4 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
-
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP3-JeOS-for-VMware-x86_64-jeos-filesystem_xenhvm@svirt-vmware65 fails in
integration_services

Test fails to retrieve data about running VM on the host.

Unable to create SSH channel for executing "set -x; vmid=$(vim-cmd vmsvc/getallvms | awk '/openQA-SUT-1/ { print $1 }');if [ $vmid ]; then vim-cmd vmsvc/get.guest $vmid | awk '/ipAddress/ {print $3}' | head -n1 | sed -e 's/"//g' | sed -e 's/,//g'; fi": no libssh2 error registered at /usr/lib/os-autoinst/backend/baseclass.pm line 1416.

Reproducible

Fails since (at least) Build 20.139

Expected result

Last good: 20.100 (or more recent)

Further details

Always latest result in this scenario: latest


Related issues 1 (0 open1 closed)

Related to openQA Project (public) - action #44771: [tools][svirt] Can't call method "exec" on an undefined valueResolvedmkittler2018-12-05

Actions
Actions #1

Updated by mloviska about 4 years ago

  • Related to action #44771: [tools][svirt] Can't call method "exec" on an undefined value added
Actions #2

Updated by tjyrinki_suse almost 4 years ago

  • Subject changed from Failed to create ssh channel to vmware host in integration_services module to [jeos] Failed to create ssh channel to vmware host in integration_services module
  • Start date deleted (2021-01-08)
Actions #3

Updated by okurz almost 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: jeos-filesystem_xenhvm@svirt-vmware65
https://openqa.suse.de/tests/5423352

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released"
  3. The label in the openQA scenario is removed
Actions #4

Updated by nanzhang almost 4 years ago

The same issue can be reproduced in sle-15-SP3-Online-x86_64-Build136.1-textmode_svirt@svirt-vmware65.
http://10.67.129.66/tests/38#step/integration_services/8

Please take this issue with priority, as we need to enable related tests for 3rd party hypervisors in OSD as per VT QA roadmap.

Actions #5

Updated by ph03nix almost 4 years ago

  • Status changed from New to In Progress
  • Assignee set to ph03nix
Actions #6

Updated by ph03nix almost 4 years ago

This part is only failing, if we run the consoletest_setup test.

Passing without consoletest_setup: http://duck-norris.qam.suse.de/tests/5069#
Failing with consoletest_setup: http://duck-norris.qam.suse.de/tests/5070#step/integration_services/5

Actions #7

Updated by ybonatakis almost 4 years ago

  • Assignee changed from ph03nix to ybonatakis
Actions #8

Updated by ybonatakis almost 4 years ago

https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/12008

The above PR is a small workaround to the issue.
Moving the integration_services module early in the scheduling seems to

fix the problem.

From my observations, the module looks to reuse the existing ssh connection but for some reason that it is not obvious in the logs, fails to create a new channel ($ssh->channel from NET::SSH2). Based on Marius this might be because "it depends on the state of the already opened connection whether opening a further channel is possible at this point."

One of my first thoughts was to replace NET::SSH2 with Net::OpenSSH, which consider more stable. However this needs changes in os-autoinst for starters.

Then, other options might be to create a new ssh connection. Unfortunately i am not sure how to do that in first touch.
I decided to play around with the options that are passed into the run_ssh_cmd which is where the failure appears.

[0m[2021-02-22T18:30:15.968 CET] [debug] <<< backend::baseclass::run_ssh_cmd(cmd="virsh -c esx://root\@openqaw8-vmware.qa.suse.de/?no_verify=1\\&authfile=/tmp/l657F9fdDQ  undefine --snapshots-metadata openQA-SUT-15", wantarray=0, keep_open=1)

So i modified the chain of the fallbacks subroutines using keep_open=0. This didnt seem to have any affect.

In addtion, the authfile looks like not a valid parameter, which i can confirm from the documentation(https://libvirt.org/drvesx.html#extraparams)
This raise a constantly warning

2021-02-11 16:06:53.939+0000: 4723: warning : esxUtil_ParseUri:149 : Ignoring unexpected query parameter 'authfile'

However i am not sure how related is with the issue.


Note for me: http://aquarius.suse.cz/tests/4972/file/autoinst-log.txt :different error?

Actions #9

Updated by mloviska almost 4 years ago

  • Tags set to qac, jeos
Actions #10

Updated by jlausuch almost 4 years ago

  • Project changed from openQA Tests (public) to 208
  • Category deleted (Bugs in existing tests)
Actions #11

Updated by mkittler almost 4 years ago

I've been creating a link for Net::SSH2 in our devel repo: osc linkpac openSUSE:Factory perl-Net-SSH2 devel:openQA:Leap:15.2

So the production workers will be updated to the most recent version on the next deployment. I've also just updated the package on openqaworker2 manually.

Actions #12

Updated by ybonatakis almost 4 years ago

mkittler wrote:

I've been creating a link for Net::SSH2 in our devel repo: osc linkpac openSUSE:Factory perl-Net-SSH2 devel:openQA:Leap:15.2

So the production workers will be updated to the most recent version on the next deployment. I've also just updated the package on openqaworker2 manually.

i just want to add version update is from 0.69 to 0.72 for the record.

Actions #13

Updated by ybonatakis almost 4 years ago

The module fails when it tries to create another channel for the selected ssh connection. What we see is

Net::SSH2=SCALAR(0x55aea0041f00)
libssh2_channel_open_ex(ss->session, mandatory_type, strlen(mandatory_type), window_size, packet_size, ((void *)0) , 0 ) -> 0x0

when this happens the ssh service produce a record that point to the root of the error

Apr 10 09:18:29 openqaw5-xen sshd[1998]: Accepted keyboard-interactive/pam for root from 10.100.224.110 port 38502 ssh2
Apr 10 09:18:48 openqaw5-xen sshd[2038]: Accepted keyboard-interactive/pam for root from 10.100.224.110 port 38504 ssh2
Apr 10 09:22:30 openqaw5-xen sshd[2102]: error: kex_exchange_identification: Connection closed by remote host

Looking in the source code of openssh[0] we can discover thatt the error shows up when the error code is eq with EPIPE. Turning to the definition of that specific code[1]

Macro: int EPIPE
“Broken pipe.” There is no process reading from the other end of a pipe. Every library function that returns this error code also generates a SIGPIPE signal; this signal terminates the program if not handled or blocked. Thus, your program will never actually see EPIPE unless it has handled or blocked SIGPIPE.

So something between the server and the client is interrupted during the kex_exchange_identification function of openssh which exchange the identification strings between them.

i tried to captured some traffic for analysis but my skills are limited here. Ask me for the pcap for anyone who can take a look(you might need to create a new one however and retrieve the keys for the data decryption. i cant upload the file file due to the size limits of the file (>20M)).

Having the above finding the only thing i can think to do for now is to schedule the module earlier for jeos which seems to work in the majority of the VRs. or add a softfail(??)

[0] https://github.com/openssh/openssh-portable/blob/V_8_2/kex.c#L1247
[1] https://www.gnu.org/software/libc/manual/html_node/Error-Codes.html

Actions #14

Updated by cfconrad almost 4 years ago

I wasn't able to fix it in os-autoinst-distri-opensuse, but one of the following give use a solution:

https://github.com/os-autoinst/os-autoinst/pull/1643 (just an improvement, to make the run_cmd() more configurable)
https://github.com/os-autoinst/os-autoinst/pull/1642 (preferred)

Actions #15

Updated by ybonatakis over 3 years ago

  • Status changed from In Progress to Feedback
Actions #16

Updated by nanzhang over 3 years ago

Latest run passed on build 176.1(snapshot15 candidate) without the ssh issue, we may need one more build to verify the stability.

Actions #17

Updated by cfconrad over 3 years ago

With this https://github.com/os-autoinst/os-autoinst/pull/1660 , we are now also able to call the console('svirt')->run_cmd($cmd, wantarray => 1, keep_open => 0) directly.

Actions #18

Updated by ybonatakis over 3 years ago

  • Status changed from Feedback to Resolved

merged

Actions #19

Updated by ph03nix 3 months ago

  • Tags changed from qac, jeos to MinimalVM
Actions #20

Updated by ph03nix 3 months ago

  • Project changed from 208 to Containers and images
Actions

Also available in: Atom PDF