Project

General

Profile

action #109112

openQA Project - coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation

openQA Project - coordination #109656: [epic] Stable non-qemu backends

Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M

Added by JERiveraMoya 3 months ago. Updated 16 days ago.

Status:
Workable
Priority:
Low
Assignee:
-
Target version:
Start date:
2022-03-28
Due date:
% Done:

0%

Estimated time:

Description

Test died: Error connecting to root@redcurrant-4.qa.suse.de: No route to host

Observation

We have issues in multiple scenarios in first boot when connecting to PowerVM (and also we found some ipmi job).
The first test that tries to run select_console('root-console'); fails.

In ppc64le PowerVM:
https://openqa.suse.de/tests/8418948#step/validate_lvm/1
https://openqa.suse.de/tests/8420902#step/system_prepare/1
https://openqa.suse.de/tests/8420907#step/validate_partition_table_via_blkid/1
https://openqa.suse.de/tests/8420908#step/validate_lvm/1
https://openqa.suse.de/tests/8420920#step/validate_partition_table_via_parted/1

From logs:

XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":51899"
      after 28647 requests (28647 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":51899"
[2022-03-28T13:45:12.281350+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 173481 and exit status: 1
[2022-03-28T13:45:12.282681+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174616 and exit status: 84
[2022-03-28T13:45:12.282797+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174619 and exit status: 0
[2022-03-28T13:45:12.461944+02:00] [debug] Connected to Xvnc - PID 177124
icewm PID is 177169
[2022-03-28T13:45:13.468637+02:00] [debug] Wait for SSH on host redcurrant-4.qa.suse.de (timeout: 120)
[2022-03-28T13:47:13.688450+02:00] [debug] redcurrant-4.qa.suse.de does not seems to have an active SSH server. Continuing anyway.
xterm PID is 178945
[2022-03-28T13:47:13.696027+02:00] [debug] <<< backend::baseclass::start_ssh_serial(username="root", password="SECRET", hostname="redcurrant-4.qa.suse.de")
[2022-03-28T13:47:13.696288+02:00] [debug] <<< backend::baseclass::new_ssh_connection(password="SECRET", hostname="redcurrant-4.qa.suse.de", username="root")
[2022-03-28T13:47:14.840534+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:27.960550+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:41.070671+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:54.190507+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:07.320520+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:17.325260+02:00] [debug] post_fail_hook failed: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host at /usr/lib/os-autoinst/testapi.pm line 1759.
      testapi::select_console("root-ssh") called at sle/lib/Utils/Backends.pm line 83

In x86_64 ipmi: https://openqa.suse.de/tests/8420870#step/system_prepare/1

Acceptance criteria

Suggestions

  • We accept the hypothesis that the jobs just failed due to lower level network issues #108845 which already received a fix meanwhile so nothing to do for the immediate root cause
  • We can improve though:
    • There is a typo to fix in the message "does not seems"
    • Do not continue after ssh connect fails
    • But be explicit about the root cause. The test finally aborts with "No route to host" so we should have access to that message. for example in https://github.com/os-autoinst/os-autoinst/blob/master/consoles/sshXtermVt.pm#L60 make sure that the error details (underlying error message in $! or $@) are used for a better error message
    • Make sure that we have unit test coverage with some mocking for this behaviour

Related issues

Related to openQA Tests - action #98832: [qac][container][powerVM] rebootmgr fails in PowerVM reconnecting after rebootNew2021-09-17

Has duplicate openQA Tests - action #109719: [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"New2022-04-08

History

#1 Updated by okurz 3 months ago

  • Priority changed from Normal to High
  • Target version set to Ready

#2 Updated by JERiveraMoya 3 months ago

  • Related to action #98832: [qac][container][powerVM] rebootmgr fails in PowerVM reconnecting after reboot added

#3 Updated by mkittler 3 months ago

I've just tried to connect to redcurrant-4.qa.suse.de via SSH and it works besides the fact that I don't know the password. Not sure how we can help here, it looks just like a temporary networking issue, maybe even related to #108845.

#4 Updated by okurz 3 months ago

  • Subject changed from Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host to Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host size:M
  • Description updated (diff)
  • Status changed from New to Workable

#5 Updated by okurz 3 months ago

  • Subject changed from Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host size:M to Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M
  • Priority changed from High to Low

#6 Updated by cdywan 3 months ago

  • Has duplicate action #109719: [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" added

#7 Updated by okurz 3 months ago

  • Parent task set to #109656

#8 Updated by openqa_review 2 months ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: RAID0@ppc64le-hmc-4disk
https://openqa.suse.de/tests/8570565#step/validate_md_raid/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

#9 Updated by JERiveraMoya about 2 months ago

keep happening sporadically and seems unrelated with builds/network/power issues.

#10 Updated by JERiveraMoya about 2 months ago

In transactional scenario we saw since a lot of time the same issue in a non-sporadic way:
#98832

#11 Updated by openqa_review about 1 month ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: transactional_server_helper_apps@ppc64le-hmc-single-disk
https://openqa.suse.de/tests/8743569#step/rebootmgr/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

#12 Updated by openqa_review 16 days ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: transactional_server_helper_apps@ppc64le-hmc-single-disk
https://openqa.suse.de/tests/8752118#step/rebootmgr/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 56 days if nothing changes in this ticket.

Also available in: Atom PDF