Project

General

Profile

Actions

action #109112

open

openQA Project - coordination #125708: [epic] Future ideas for more stable non-qemu backends

Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M

Added by JERiveraMoya about 2 years ago. Updated about 1 year ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
-
Target version:
Start date:
2022-03-28
Due date:
% Done:

0%

Estimated time:

Description

Test died: Error connecting to root@redcurrant-4.qa.suse.de: No route to host

Observation

We have issues in multiple scenarios in first boot when connecting to PowerVM (and also we found some ipmi job).
The first test that tries to run select_console('root-console'); fails.

In ppc64le PowerVM:
https://openqa.suse.de/tests/8418948#step/validate_lvm/1
https://openqa.suse.de/tests/8420902#step/system_prepare/1
https://openqa.suse.de/tests/8420907#step/validate_partition_table_via_blkid/1
https://openqa.suse.de/tests/8420908#step/validate_lvm/1
https://openqa.suse.de/tests/8420920#step/validate_partition_table_via_parted/1

From logs:

XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":51899"
      after 28647 requests (28647 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":51899"
[2022-03-28T13:45:12.281350+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 173481 and exit status: 1
[2022-03-28T13:45:12.282681+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174616 and exit status: 84
[2022-03-28T13:45:12.282797+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174619 and exit status: 0
[2022-03-28T13:45:12.461944+02:00] [debug] Connected to Xvnc - PID 177124
icewm PID is 177169
[2022-03-28T13:45:13.468637+02:00] [debug] Wait for SSH on host redcurrant-4.qa.suse.de (timeout: 120)
[2022-03-28T13:47:13.688450+02:00] [debug] redcurrant-4.qa.suse.de does not seems to have an active SSH server. Continuing anyway.
xterm PID is 178945
[2022-03-28T13:47:13.696027+02:00] [debug] <<< backend::baseclass::start_ssh_serial(username="root", password="SECRET", hostname="redcurrant-4.qa.suse.de")
[2022-03-28T13:47:13.696288+02:00] [debug] <<< backend::baseclass::new_ssh_connection(password="SECRET", hostname="redcurrant-4.qa.suse.de", username="root")
[2022-03-28T13:47:14.840534+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:27.960550+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:41.070671+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:54.190507+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:07.320520+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:17.325260+02:00] [debug] post_fail_hook failed: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host at /usr/lib/os-autoinst/testapi.pm line 1759.
      testapi::select_console("root-ssh") called at sle/lib/Utils/Backends.pm line 83

In x86_64 ipmi: https://openqa.suse.de/tests/8420870#step/system_prepare/1

Acceptance criteria

Suggestions

  • We accept the hypothesis that the jobs just failed due to lower level network issues #108845 which already received a fix meanwhile so nothing to do for the immediate root cause
  • We can improve though:
    • There is a typo to fix in the message "does not seems"
    • Do not continue after ssh connect fails
    • But be explicit about the root cause. The test finally aborts with "No route to host" so we should have access to that message. for example in https://github.com/os-autoinst/os-autoinst/blob/master/consoles/sshXtermVt.pm#L60 make sure that the error details (underlying error message in $! or $@) are used for a better error message
    • Make sure that we have unit test coverage with some mocking for this behaviour

Related issues 3 (2 open1 closed)

Related to openQA Tests - action #98832: [qac][container][powerVM] rebootmgr fails in PowerVM reconnecting after rebootNew2021-09-17

Actions
Related to qe-yam - action #117127: Run rebootmgr in transactional_server_helper_apps in PowerVM only in YaST development groupResolvedleli2022-09-23

Actions
Has duplicate openQA Tests - action #109719: [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"New2022-04-08

Actions
Actions

Also available in: Atom PDF