action #109112
openopenQA Project - coordination #125708: [epic] Future ideas for more stable non-qemu backends
Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M
0%
Description
Test died: Error connecting to root@redcurrant-4.qa.suse.de: No route to host
Observation¶
We have issues in multiple scenarios in first boot when connecting to PowerVM (and also we found some ipmi job).
The first test that tries to run select_console('root-console');
fails.
In ppc64le PowerVM:
https://openqa.suse.de/tests/8418948#step/validate_lvm/1
https://openqa.suse.de/tests/8420902#step/system_prepare/1
https://openqa.suse.de/tests/8420907#step/validate_partition_table_via_blkid/1
https://openqa.suse.de/tests/8420908#step/validate_lvm/1
https://openqa.suse.de/tests/8420920#step/validate_partition_table_via_parted/1
From logs:
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":51899"
after 28647 requests (28647 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":51899"
[2022-03-28T13:45:12.281350+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 173481 and exit status: 1
[2022-03-28T13:45:12.282681+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174616 and exit status: 84
[2022-03-28T13:45:12.282797+02:00] [info] ::: backend::driver::__ANON__: Driver backend collected unknown process with pid 174619 and exit status: 0
[2022-03-28T13:45:12.461944+02:00] [debug] Connected to Xvnc - PID 177124
icewm PID is 177169
[2022-03-28T13:45:13.468637+02:00] [debug] Wait for SSH on host redcurrant-4.qa.suse.de (timeout: 120)
[2022-03-28T13:47:13.688450+02:00] [debug] redcurrant-4.qa.suse.de does not seems to have an active SSH server. Continuing anyway.
xterm PID is 178945
[2022-03-28T13:47:13.696027+02:00] [debug] <<< backend::baseclass::start_ssh_serial(username="root", password="SECRET", hostname="redcurrant-4.qa.suse.de")
[2022-03-28T13:47:13.696288+02:00] [debug] <<< backend::baseclass::new_ssh_connection(password="SECRET", hostname="redcurrant-4.qa.suse.de", username="root")
[2022-03-28T13:47:14.840534+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:27.960550+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:41.070671+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:47:54.190507+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:07.320520+02:00] [debug] Could not connect to root@redcurrant-4.qa.suse.de, Retrying after some seconds...
[2022-03-28T13:48:17.325260+02:00] [debug] post_fail_hook failed: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host at /usr/lib/os-autoinst/testapi.pm line 1759.
testapi::select_console("root-ssh") called at sle/lib/Utils/Backends.pm line 83
In x86_64 ipmi: https://openqa.suse.de/tests/8420870#step/system_prepare/1
Acceptance criteria¶
- AC1: Significantly higher code coverage in https://app.codecov.io/gh/os-autoinst/os-autoinst/blob/master/consoles/sshXtermVt.pm
- AC2: The typo is gone, e.g. just everything removed :)
Suggestions¶
- We accept the hypothesis that the jobs just failed due to lower level network issues #108845 which already received a fix meanwhile so nothing to do for the immediate root cause
- We can improve though:
- There is a typo to fix in the message "does not seems"
- Do not continue after ssh connect fails
- But be explicit about the root cause. The test finally aborts with "No route to host" so we should have access to that message. for example in https://github.com/os-autoinst/os-autoinst/blob/master/consoles/sshXtermVt.pm#L60 make sure that the error details (underlying error message in $! or $@) are used for a better error message
- Make sure that we have unit test coverage with some mocking for this behaviour
Updated by okurz over 2 years ago
- Priority changed from Normal to High
- Target version set to Ready
Updated by JERiveraMoya over 2 years ago
- Related to action #98832: [qac][container][powerVM] rebootmgr fails in PowerVM reconnecting after reboot added
Updated by mkittler over 2 years ago
I've just tried to connect to redcurrant-4.qa.suse.de via SSH and it works besides the fact that I don't know the password. Not sure how we can help here, it looks just like a temporary networking issue, maybe even related to #108845.
Updated by okurz over 2 years ago
- Subject changed from Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host to Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 2 years ago
- Subject changed from Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host size:M to Improve os-autoinst sshXtermVt.pm connection error handling (was: "Test died: Error connecting to <root@redcurrant-4.qa.suse.de>: No route to host") size:M
- Priority changed from High to Low
Updated by livdywan over 2 years ago
- Has duplicate action #109719: [qe-core][ppc][hmc] Network issues on ppc64le-hmc workers : "Error connecting to redcurrant-3.qa.suse.de: no route to host" added
Updated by openqa_review over 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: RAID0@ppc64le-hmc-4disk
https://openqa.suse.de/tests/8570565#step/validate_md_raid/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by JERiveraMoya over 2 years ago
keep happening sporadically and seems unrelated with builds/network/power issues.
Updated by JERiveraMoya over 2 years ago
In transactional scenario we saw since a lot of time the same issue in a non-sporadic way:
#98832
Updated by openqa_review over 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: transactional_server_helper_apps@ppc64le-hmc-single-disk
https://openqa.suse.de/tests/8743569#step/rebootmgr/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by openqa_review over 2 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: transactional_server_helper_apps@ppc64le-hmc-single-disk
https://openqa.suse.de/tests/8752118#step/rebootmgr/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 56 days if nothing changes in this ticket.
Updated by okurz over 2 years ago
- Target version changed from Ready to future
We within the SUSE QE Tools team currently do not have the capacity to work on this, removing from backlog.
Updated by JERiveraMoya about 2 years ago
- Related to action #117127: Run rebootmgr in transactional_server_helper_apps in PowerVM only in YaST development group added