action #109112: Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

action #109112

open

openQA Project (public) - coordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)

openQA Project (public) - coordination #125708: [epic] Future ideas for more stable non-qemu backends

Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M

Added by JERiveraMoya about 3 years ago. Updated about 2 years ago.

Status:

Workable

Priority:

Low

Assignee:

Category:

Target version:

QA (public) - future

Start date:

2022-03-28

Due date:

% Done:

Estimated time:

Description

Test died: Error connecting to root@redcurrant-4.qa.suse.de: No route to host

Observation¶

We have issues in multiple scenarios in first boot when connecting to PowerVM (and also we found some ipmi job).
The first test that tries to run select_console('root-console'); fails.

In ppc64le PowerVM:
https://openqa.suse.de/tests/8418948#step/validate_lvm/1
https://openqa.suse.de/tests/8420902#step/system_prepare/1
https://openqa.suse.de/tests/8420907#step/validate_partition_table_via_blkid/1
https://openqa.suse.de/tests/8420908#step/validate_lvm/1
https://openqa.suse.de/tests/8420920#step/validate_partition_table_via_parted/1

From logs:

XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":51899"
      after 28647 requests (28647 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":51899"
[2022-03-28T13:45:12.281350+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 173481 and exit status: 1
[2022-03-28T13:45:12.282681+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174616 and exit status: 84
[2022-03-28T13:45:12.282797+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174619 and exit status: 0
[2022-03-28T13:45:12.461944+02:00] [debug] Connected to Xvnc - PID 177124
icewm PID is 177169
[2022-03-28T13:45:13.468637+02:00] [debug] Wait for SSH on host redcurrant-4.qa.suse.de (timeout: 120)
[2022-03-28T13:47:13.688450+02:00] [debug] redcurrant-4.qa.suse.de does not seems to have an active SSH server. Continuing anyway.
xterm PID is 178945
[2022-03-28T13:47:13.696027+02:00] [debug] <<< backend::baseclass::start_ssh_serial(username="root", password="SECRET", hostname="redcurrant-4.qa.suse.de")
[2022-03-28T13:47:13.696288+02:00] [debug] <<< backend::baseclass::new_ssh_connection(password="SECRET", hostname="redcurrant-4.qa.suse.de", username="root")
[2022-03-28T13:47:14.840534+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:27.960550+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:41.070671+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:54.190507+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:07.320520+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:17.325260+02:00] [debug] post_fail_hook failed: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host at /usr/lib/os-autoinst/testapi.pm line 1759.
      testapi::select_console("root-ssh") called at sle/lib/Utils/Backends.pm line 83

In x86_64 ipmi: https://openqa.suse.de/tests/8420870#step/system_prepare/1

Acceptance criteria¶

AC1: Significantly higher code coverage in https://app.codecov.io/gh/os-autoinst/os-autoinst/blob/master/consoles/sshXtermVt.pm
AC2: The typo is gone, e.g. just everything removed :)

Suggestions¶

We accept the hypothesis that the jobs just failed due to lower level network issues #108845 which already received a fix meanwhile so nothing to do for the immediate root cause
We can improve though:
- There is a typo to fix in the message "does not seems"
- Do not continue after ssh connect fails
- But be explicit about the root cause. The test finally aborts with "No route to host" so we should have access to that message. for example in https://github.com/os-autoinst/os-autoinst/blob/master/consoles/sshXtermVt.pm#L60 make sure that the error details (underlying error message in $! or $@) are used for a better error message
- Make sure that we have unit test coverage with some mocking for this behaviour

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by okurz about 3 years ago

Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by JERiveraMoya about 3 years ago

Related to action #98832: [qac][container][powerVM] rebootmgr fails in PowerVM reconnecting after reboot added

Actions

Copy link

Updated by mkittler about 3 years ago

I've just tried to connect to redcurrant-4.qa.suse.de via SSH and it works besides the fact that I don't know the password. Not sure how we can help here, it looks just like a temporary networking issue, maybe even related to #108845.

Actions

Copy link

Updated by okurz about 3 years ago

Subject changed from Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host to Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz about 3 years ago

Subject changed from Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host size:M to Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M
Priority changed from High to Low

Actions

Copy link

Updated by livdywan about 3 years ago

Has duplicate action #109719: [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" added

Actions

Copy link

Updated by okurz about 3 years ago

Parent task set to #109656

Actions

Copy link

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: RAID0@ppc64le-hmc-4disk
https://openqa.suse.de/tests/8570565#step/validate_md_raid/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

Updated by JERiveraMoya about 3 years ago

keep happening sporadically and seems unrelated with builds/network/power issues.

Actions

Copy link

#10

Updated by JERiveraMoya about 3 years ago

In transactional scenario we saw since a lot of time the same issue in a non-sporadic way:
#98832

Actions

Copy link

#11

Updated by openqa_review about 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: transactional_server_helper_apps@ppc64le-hmc-single-disk
https://openqa.suse.de/tests/8743569#step/rebootmgr/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions

Copy link

#12

Updated by openqa_review almost 3 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: transactional_server_helper_apps@ppc64le-hmc-single-disk
https://openqa.suse.de/tests/8752118#step/rebootmgr/1

To prevent further reminder comments one of the following options should be followed:

The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
The openQA job group is moved to "Released" or "EOL" (End-of-Life)
The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 56 days if nothing changes in this ticket.

Actions

Copy link

#13

Updated by okurz almost 3 years ago

Target version changed from Ready to future

We within the SUSE QE Tools team currently do not have the capacity to work on this, removing from backlog.

Actions

Copy link

#14

Updated by JERiveraMoya over 2 years ago

Related to action #117127: Run rebootmgr in transactional_server_helper_apps in PowerVM only in YaST development group added

Actions

Copy link

#15

Updated by okurz about 2 years ago

Parent task changed from #109656 to #125708

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #109112

Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz about 3 years ago

Updated by JERiveraMoya about 3 years ago

Updated by mkittler about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by openqa_review about 3 years ago

Updated by JERiveraMoya about 3 years ago

Updated by JERiveraMoya about 3 years ago

Updated by openqa_review about 3 years ago

Updated by openqa_review almost 3 years ago

Updated by okurz almost 3 years ago

Updated by JERiveraMoya over 2 years ago

Updated by okurz about 2 years ago