action #99123
closedssh based backends can run into timeout if ssh connection is stuck
Description
Observation¶
From https://suse.slack.com/archives/C02CANHLANP/p1632408138493900
There are also a lot of jobs failed on bootloader for PowerPC: https://openqa.suse.de/tests/7200264#step/bootloader_start/3
this job ran into the default openQA 2h timeout. Excerpt from log:
[2021-09-23T05:35:13.884 CEST] [debug] <<< backend::baseclass::run_ssh(cmd="! lssyscfg -m redcurrant -r lpar --filter 'lpar_ids=8' -F state | grep -i 'not activated' -q", password="SECRET", username="hscroot", wantarray=0, keep_open=0, hostname="powerhmc1.arch.suse.de")
[2021-09-23T05:35:13.885 CEST] [debug] <<< backend::baseclass::new_ssh_connection(wantarray=0, hostname="powerhmc1.arch.suse.de", keep_open=0, blocking=1, password="SECRET", username="hscroot")
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":39057"
after 39145 requests (39145 known processed) with 0 events remaining.
[2021-09-23T07:28:47.185 CEST] [debug] backend got TERM
so something did not properly timeout within 2h, could it be the lssyscfg command?
Suggestions¶
I suggest to improve the ssh command to not be stuck for 2h but timeout after a reasonable time. That would be a start
Updated by okurz about 3 years ago
- Status changed from New to Feedback
https://github.com/os-autoinst/os-autoinst/pull/1780
hoping that someone can provide proper testing before we merge this. If not then I would change …
EDIT: Disregard. I decided to update the change to apply no timeout by default but allow to set it with simple test parameters. This should allow safe testing in production on a case by case base.
Updated by okurz about 3 years ago
openqa-clone-job --within-instance https://openqa.suse.de/tests/7270694 BUILD= _GROUP=0 TEST=skip_registration-okurz_poo99123_ssh_timeout
openqa-clone-job --within-instance https://openqa.suse.de/tests/7270694 BUILD= _GROUP=0 TEST=skip_registration-okurz_poo99123_ssh_timeout_1s SSH_COMMAND_TIMEOUT_S=1
openqa-clone-job --within-instance https://openqa.suse.de/tests/7270694 BUILD= _GROUP=0 TEST=skip_registration-okurz_poo99123_ssh_timeout_30s SSH_COMMAND_TIMEOUT_S=30
openqa-clone-job --within-instance https://openqa.suse.de/tests/7270694 BUILD= _GROUP=0 TEST=skip_registration-okurz_poo99123_ssh_timeout_3600s SSH_COMMAND_TIMEOUT_S=3600
Created job #7274699: sle-15-SP4-Full-ppc64le-Build43.1-skip_registration@ppc64le-hmc-single-disk -> https://openqa.suse.de/t7282966
Created job #7274701: sle-15-SP4-Full-ppc64le-Build43.1-skip_registration@ppc64le-hmc-single-disk -> https://openqa.suse.de/t7282963
Created job #7274702: sle-15-SP4-Full-ppc64le-Build43.1-skip_registration@ppc64le-hmc-single-disk -> https://openqa.suse.de/t7282964
Created job #7274704: sle-15-SP4-Full-ppc64le-Build43.1-skip_registration@ppc64le-hmc-single-disk -> https://openqa.suse.de/t7282965
Updated by okurz about 3 years ago
- Due date changed from 2021-10-07 to 2021-10-14
As expected the job without timeout works fine as well as all with a timeout bigger than 1s.
https://github.com/os-autoinst/os-autoinst/pull/1806 to suggest a sensible default timeout.
Updated by okurz almost 3 years ago
- Status changed from Feedback to Resolved
Was recently deployed. If tests would run into timeout then it would likely show as "Test died: Unable to establish SSH channel for serial console: Timed out waiting on socket". I checked new, unknown test failures from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, around 50 jobs, but found no related failures. So I am resolving this ticket.