action #99111
closed
coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation
coordination #99030: [epic] openQA bare-metal test dies due to lost SSH connection auto_review:"backend died: Lost SSH connection to SUT: Failure while draining incoming flow":retry
Confirm or disprove that openQA bare-metal test loses SSH connection due to package updates size:M
Added by livdywan about 3 years ago.
Updated almost 3 years ago.
Category:
Regressions/Crashes
- I wouldn't know how to login to duck-norris.qam.suse.de - apparently the machine isn't using our normal salt config. Even if I could login, I'm not sure whether it makes sense to play around on a host which is actually administrated by somebody else.
- For gathering statistics it would be good to know what kind of jobs we're talking about. Any kind of job with
BACKEND=ipmi
or BACKEND=svirt
? One obviously needs some search criteria to find relevant jobs (e.g. via an SQL query) to get the numbers for the comparison and to re-trigger the jobs.
- Judging by the creation date of the epic and jobs the problem was noticed around 2021-09-21. So we would likely need to boot into a snapshot a few weeks before that.
mkittler wrote:
- I wouldn't know how to login to duck-norris.qam.suse.de - apparently the machine isn't using our normal salt config. Even if I could login, I'm not sure whether it makes sense to play around on a host which is actually administrated by somebody else.
I suggest you check with @ph03nix - that was obvious to me so I failed to add it explicitly, sorry about that.
- Description updated (diff)
- Description updated (diff)
- Category set to Regressions/Crashes
Please consider debugging on any other host than the one we maintain within o3 or osd as out of scope. I updated the description accordingly.
See
https://progress.opensuse.org/issues/99030#Steps-to-reproduce to find tests with the same symptom. By now it seems it's actually pretty hard to reproduce. Obviously waiting longer and longer makes it even harder to investigate. I suggest for anyone from the team to simply pick this up and clarify with @ph03nix about how this can be reproduced nowadays, what to do, where to conduct the "package rollback experiment", etc.
I have a way to reproduce the test, this is from yesterday I have run it on a up to date system. I can restart the test and it will fail again. To have access you can reach out to me on slack.
http://d453.qam.suse.de/tests/624
openqa-clone-job \
--from openqa.qam.suse.cz \
-v 30582 \
--host localhost \
--clone-children \
--parental-inheritance \
WORKER_CLASS=horror \
BACKEND=ipmi \
INCIDENT_ID= \
INCIDENT_REPO= \
CASEDIR=https://github.com/tbaev/os-autoinst-distri-opensuse.git#parallel_guest_install
Old packages are kept in the path /var/cache/zypp/packages/
- Status changed from Workable to Rejected
- Assignee changed from mkittler to okurz
By now unfortunately it's not feasible anymore to simply try a rollback.
I used openqa-query-for-job-label poo#99030
and found no matches
I created https://github.com/os-autoinst/scripts/pull/119 and with that I could run a longer query going back for 90 days interval='90 day' openqa-query-for-job-label poo#99030
:
7221423|2021-09-25 00:49:56|done|incomplete|sle-micro_containers|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7220091|2021-09-24 21:11:37|done|incomplete|ltp_cve_git|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7224329|2021-09-24 16:16:11|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
7207682|2021-09-23 21:51:28|done|incomplete|engines_and_tools_podman|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7172003|2021-09-20 09:17:44|done|incomplete|ltp_syscalls_debug_pagealloc|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7171999|2021-09-20 09:13:19|done|incomplete|kernel-live-patching|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
as the current ticket applies to changes on OSD infrastructure only and we don't see the problem there anymore we should reject and continue with other tasks in the parent epic.
Also available in: Atom
PDF