action #99111
closedcoordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation
coordination #99030: [epic] openQA bare-metal test dies due to lost SSH connection auto_review:"backend died: Lost SSH connection to SUT: Failure while draining incoming flow":retry
Confirm or disprove that openQA bare-metal test loses SSH connection due to package updates size:M
Description
Motivation¶
See http://duck-norris.qam.suse.de/tests/7274 (and the cloned jobs within). Also other jobs are affected e.g. https://openqa.suse.de/tests/7171999, http://openqa.qam.suse.cz/tests/28072
Acceptance criteria¶
- AC1: Epic is updated to clarify if package updates contributed to the problem
Suggestion¶
- Rollback package updates, i.e. older rpms that have we stored in local caches or btrfs snapshots of the root fs.
- Trigger tests after rollback and verify pass rate
Out of scope¶
- Debugging on any host outside O3 or OSD
(was: Check with @ph03nix for login to duck-norris.qam.suse.de if needed)
Updated by mkittler about 3 years ago
- I wouldn't know how to login to duck-norris.qam.suse.de - apparently the machine isn't using our normal salt config. Even if I could login, I'm not sure whether it makes sense to play around on a host which is actually administrated by somebody else.
- For gathering statistics it would be good to know what kind of jobs we're talking about. Any kind of job with
BACKEND=ipmi
orBACKEND=svirt
? One obviously needs some search criteria to find relevant jobs (e.g. via an SQL query) to get the numbers for the comparison and to re-trigger the jobs. - Judging by the creation date of the epic and jobs the problem was noticed around 2021-09-21. So we would likely need to boot into a snapshot a few weeks before that.
Updated by livdywan about 3 years ago
mkittler wrote:
- I wouldn't know how to login to duck-norris.qam.suse.de - apparently the machine isn't using our normal salt config. Even if I could login, I'm not sure whether it makes sense to play around on a host which is actually administrated by somebody else.
I suggest you check with @ph03nix - that was obvious to me so I failed to add it explicitly, sorry about that.
Updated by okurz about 3 years ago
- Description updated (diff)
- Category set to Regressions/Crashes
Please consider debugging on any other host than the one we maintain within o3 or osd as out of scope. I updated the description accordingly.
See
https://progress.opensuse.org/issues/99030#Steps-to-reproduce to find tests with the same symptom. By now it seems it's actually pretty hard to reproduce. Obviously waiting longer and longer makes it even harder to investigate. I suggest for anyone from the team to simply pick this up and clarify with @ph03nix about how this can be reproduced nowadays, what to do, where to conduct the "package rollback experiment", etc.
Updated by tbaev about 3 years ago
I have a way to reproduce the test, this is from yesterday I have run it on a up to date system. I can restart the test and it will fail again. To have access you can reach out to me on slack.
http://d453.qam.suse.de/tests/624
openqa-clone-job \
--from openqa.qam.suse.cz \
-v 30582 \
--host localhost \
--clone-children \
--parental-inheritance \
WORKER_CLASS=horror \
BACKEND=ipmi \
INCIDENT_ID= \
INCIDENT_REPO= \
CASEDIR=https://github.com/tbaev/os-autoinst-distri-opensuse.git#parallel_guest_install
Updated by okurz about 3 years ago
Old packages are kept in the path /var/cache/zypp/packages/
Updated by okurz about 3 years ago
- Status changed from Workable to Rejected
- Assignee changed from mkittler to okurz
By now unfortunately it's not feasible anymore to simply try a rollback.
I used openqa-query-for-job-label poo#99030
and found no matches
I created https://github.com/os-autoinst/scripts/pull/119 and with that I could run a longer query going back for 90 days interval='90 day' openqa-query-for-job-label poo#99030
:
7221423|2021-09-25 00:49:56|done|incomplete|sle-micro_containers|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7220091|2021-09-24 21:11:37|done|incomplete|ltp_cve_git|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7224329|2021-09-24 16:16:11|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
7207682|2021-09-23 21:51:28|done|incomplete|engines_and_tools_podman|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7172003|2021-09-20 09:17:44|done|incomplete|ltp_syscalls_debug_pagealloc|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7171999|2021-09-20 09:13:19|done|incomplete|kernel-live-patching|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
as the current ticket applies to changes on OSD infrastructure only and we don't see the problem there anymore we should reject and continue with other tasks in the parent epic.