Project

General

Profile

Actions

action #99123

closed

ssh based backends can run into timeout if ssh connection is stuck

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
Feature requests
Target version:
Start date:
2021-09-23
Due date:
2021-10-14
% Done:

0%

Estimated time:

Description

Observation

From https://suse.slack.com/archives/C02CANHLANP/p1632408138493900

There are also a lot of jobs failed on bootloader for PowerPC: https://openqa.suse.de/tests/7200264#step/bootloader_start/3

this job ran into the default openQA 2h timeout. Excerpt from log:

[2021-09-23T05:35:13.884 CEST] [debug] <<< backend::baseclass::run_ssh(cmd="! lssyscfg -m redcurrant -r lpar --filter 'lpar_ids=8' -F state | grep -i 'not activated' -q", password="SECRET", username="hscroot", wantarray=0, keep_open=0, hostname="powerhmc1.arch.suse.de")
[2021-09-23T05:35:13.885 CEST] [debug] <<< backend::baseclass::new_ssh_connection(wantarray=0, hostname="powerhmc1.arch.suse.de", keep_open=0, blocking=1, password="SECRET", username="hscroot")
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":39057"
      after 39145 requests (39145 known processed) with 0 events remaining.
[2021-09-23T07:28:47.185 CEST] [debug] backend got TERM

so something did not properly timeout within 2h, could it be the lssyscfg command?

Suggestions

I suggest to improve the ssh command to not be stuck for 2h but timeout after a reasonable time. That would be a start

Actions #1

Updated by okurz over 2 years ago

  • Status changed from New to Feedback

https://github.com/os-autoinst/os-autoinst/pull/1780

hoping that someone can provide proper testing before we merge this. If not then I would change …

EDIT: Disregard. I decided to update the change to apply no timeout by default but allow to set it with simple test parameters. This should allow safe testing in production on a case by case base.

Actions #2

Updated by okurz over 2 years ago

openqa-clone-job --within-instance https://openqa.suse.de/tests/7270694 BUILD= _GROUP=0 TEST=skip_registration-okurz_poo99123_ssh_timeout
openqa-clone-job --within-instance https://openqa.suse.de/tests/7270694 BUILD= _GROUP=0 TEST=skip_registration-okurz_poo99123_ssh_timeout_1s SSH_COMMAND_TIMEOUT_S=1
openqa-clone-job --within-instance https://openqa.suse.de/tests/7270694 BUILD= _GROUP=0 TEST=skip_registration-okurz_poo99123_ssh_timeout_30s SSH_COMMAND_TIMEOUT_S=30
openqa-clone-job --within-instance https://openqa.suse.de/tests/7270694 BUILD= _GROUP=0 TEST=skip_registration-okurz_poo99123_ssh_timeout_3600s SSH_COMMAND_TIMEOUT_S=3600

Created job #7274699: sle-15-SP4-Full-ppc64le-Build43.1-skip_registration@ppc64le-hmc-single-disk -> https://openqa.suse.de/t7282966
Created job #7274701: sle-15-SP4-Full-ppc64le-Build43.1-skip_registration@ppc64le-hmc-single-disk -> https://openqa.suse.de/t7282963
Created job #7274702: sle-15-SP4-Full-ppc64le-Build43.1-skip_registration@ppc64le-hmc-single-disk -> https://openqa.suse.de/t7282964
Created job #7274704: sle-15-SP4-Full-ppc64le-Build43.1-skip_registration@ppc64le-hmc-single-disk -> https://openqa.suse.de/t7282965

Actions #3

Updated by okurz over 2 years ago

  • Due date changed from 2021-10-07 to 2021-10-14

As expected the job without timeout works fine as well as all with a timeout bigger than 1s.

https://github.com/os-autoinst/os-autoinst/pull/1806 to suggest a sensible default timeout.

Actions #4

Updated by okurz over 2 years ago

merged, awaiting feedback from deployment.

Actions #5

Updated by okurz over 2 years ago

  • Status changed from Feedback to Resolved

Was recently deployed. If tests would run into timeout then it would likely show as "Test died: Unable to establish SSH channel for serial console: Timed out waiting on socket". I checked new, unknown test failures from https://openqa.io.suse.de/openqa-review/openqa_suse_de_status.html, around 50 jobs, but found no related failures. So I am resolving this ticket.

Actions

Also available in: Atom PDF