Project

General

Profile

Actions

action #109112

open

openQA Project - coordination #125708: [epic] Future ideas for more stable non-qemu backends

Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M

Added by JERiveraMoya almost 2 years ago. Updated about 1 year ago.

Status:
Workable
Priority:
Low
Assignee:
-
Category:
-
Target version:
Start date:
2022-03-28
Due date:
% Done:

0%

Estimated time:

Description

Test died: Error connecting to root@redcurrant-4.qa.suse.de: No route to host

Observation

We have issues in multiple scenarios in first boot when connecting to PowerVM (and also we found some ipmi job).
The first test that tries to run select_console('root-console'); fails.

In ppc64le PowerVM:
https://openqa.suse.de/tests/8418948#step/validate_lvm/1
https://openqa.suse.de/tests/8420902#step/system_prepare/1
https://openqa.suse.de/tests/8420907#step/validate_partition_table_via_blkid/1
https://openqa.suse.de/tests/8420908#step/validate_lvm/1
https://openqa.suse.de/tests/8420920#step/validate_partition_table_via_parted/1

From logs:

XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":51899"
      after 28647 requests (28647 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":51899"
[2022-03-28T13:45:12.281350+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 173481 and exit status: 1
[2022-03-28T13:45:12.282681+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174616 and exit status: 84
[2022-03-28T13:45:12.282797+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174619 and exit status: 0
[2022-03-28T13:45:12.461944+02:00] [debug] Connected to Xvnc - PID 177124
icewm PID is 177169
[2022-03-28T13:45:13.468637+02:00] [debug] Wait for SSH on host redcurrant-4.qa.suse.de (timeout: 120)
[2022-03-28T13:47:13.688450+02:00] [debug] redcurrant-4.qa.suse.de does not seems to have an active SSH server. Continuing anyway.
xterm PID is 178945
[2022-03-28T13:47:13.696027+02:00] [debug] <<< backend::baseclass::start_ssh_serial(username="root", password="SECRET", hostname="redcurrant-4.qa.suse.de")
[2022-03-28T13:47:13.696288+02:00] [debug] <<< backend::baseclass::new_ssh_connection(password="SECRET", hostname="redcurrant-4.qa.suse.de", username="root")
[2022-03-28T13:47:14.840534+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:27.960550+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:41.070671+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:54.190507+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:07.320520+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:17.325260+02:00] [debug] post_fail_hook failed: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host at /usr/lib/os-autoinst/testapi.pm line 1759.
      testapi::select_console("root-ssh") called at sle/lib/Utils/Backends.pm line 83

In x86_64 ipmi: https://openqa.suse.de/tests/8420870#step/system_prepare/1

Acceptance criteria

Suggestions

  • We accept the hypothesis that the jobs just failed due to lower level network issues #108845 which already received a fix meanwhile so nothing to do for the immediate root cause
  • We can improve though:
    • There is a typo to fix in the message "does not seems"
    • Do not continue after ssh connect fails
    • But be explicit about the root cause. The test finally aborts with "No route to host" so we should have access to that message. for example in https://github.com/os-autoinst/os-autoinst/blob/master/consoles/sshXtermVt.pm#L60 make sure that the error details (underlying error message in $! or $@) are used for a better error message
    • Make sure that we have unit test coverage with some mocking for this behaviour

Related issues 3 (2 open1 closed)

Related to openQA Tests - action #98832: [qac][container][powerVM] rebootmgr fails in PowerVM reconnecting after rebootNew2021-09-17

Actions
Related to qe-yam - action #117127: Run rebootmgr in transactional_server_helper_apps in PowerVM only in YaST development groupResolvedleli2022-09-23

Actions
Has duplicate openQA Tests - action #109719: [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host"New2022-04-08

Actions
Actions #1

Updated by okurz almost 2 years ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #2

Updated by JERiveraMoya almost 2 years ago

  • Related to action #98832: [qac][container][powerVM] rebootmgr fails in PowerVM reconnecting after reboot added
Actions #3

Updated by mkittler almost 2 years ago

I've just tried to connect to redcurrant-4.qa.suse.de via SSH and it works besides the fact that I don't know the password. Not sure how we can help here, it looks just like a temporary networking issue, maybe even related to #108845.

Actions #4

Updated by okurz almost 2 years ago

  • Subject changed from Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host to Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #5

Updated by okurz almost 2 years ago

  • Subject changed from Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host size:M to Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M
  • Priority changed from High to Low
Actions #6

Updated by livdywan almost 2 years ago

  • Has duplicate action #109719: [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" added
Actions #7

Updated by okurz almost 2 years ago

  • Parent task set to #109656
Actions #8

Updated by openqa_review almost 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: RAID0@ppc64le-hmc-4disk
https://openqa.suse.de/tests/8570565#step/validate_md_raid/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #9

Updated by JERiveraMoya almost 2 years ago

keep happening sporadically and seems unrelated with builds/network/power issues.

Actions #10

Updated by JERiveraMoya almost 2 years ago

In transactional scenario we saw since a lot of time the same issue in a non-sporadic way:
#98832

Actions #11

Updated by openqa_review almost 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: transactional_server_helper_apps@ppc64le-hmc-single-disk
https://openqa.suse.de/tests/8743569#step/rebootmgr/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #12

Updated by openqa_review almost 2 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: transactional_server_helper_apps@ppc64le-hmc-single-disk
https://openqa.suse.de/tests/8752118#step/rebootmgr/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 56 days if nothing changes in this ticket.

Actions #13

Updated by okurz over 1 year ago

  • Target version changed from Ready to future

We within the SUSE QE Tools team currently do not have the capacity to work on this, removing from backlog.

Actions #14

Updated by JERiveraMoya over 1 year ago

  • Related to action #117127: Run rebootmgr in transactional_server_helper_apps in PowerVM only in YaST development group added
Actions #15

Updated by okurz about 1 year ago

  • Parent task changed from #109656 to #125708
Actions

Also available in: Atom PDF