action #99111: Confirm or disprove that openQA bare-metal test loses SSH connection due to package updates size:M - openQA Project (public) - openSUSE Project Management Tool

Actions

action #99111

closed

coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation

coordination #99030: [epic] openQA bare-metal test dies due to lost SSH connection auto_review:"backend died: Lost SSH connection to SUT: Failure while draining incoming flow":retry

Confirm or disprove that openQA bare-metal test loses SSH connection due to package updates size:M

Added by livdywan over 3 years ago. Updated over 3 years ago.

Status:

Rejected

Priority:

High

Assignee:

okurz

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2021-09-23

Due date:

% Done:

Estimated time:

Description

Motivation¶

See http://duck-norris.qam.suse.de/tests/7274 (and the cloned jobs within). Also other jobs are affected e.g. https://openqa.suse.de/tests/7171999, http://openqa.qam.suse.cz/tests/28072

Acceptance criteria¶

AC1: Epic is updated to clarify if package updates contributed to the problem

Suggestion¶

Rollback package updates, i.e. older rpms that have we stored in local caches or btrfs snapshots of the root fs.
Trigger tests after rollback and verify pass rate

Out of scope¶

Debugging on any host outside O3 or OSD ~~(was: Check with @ph03nix for login to duck-norris.qam.suse.de if needed)~~

Actions

Copy link

Updated by mkittler over 3 years ago

I wouldn't know how to login to duck-norris.qam.suse.de - apparently the machine isn't using our normal salt config. Even if I could login, I'm not sure whether it makes sense to play around on a host which is actually administrated by somebody else.
For gathering statistics it would be good to know what kind of jobs we're talking about. Any kind of job with BACKEND=ipmi or BACKEND=svirt? One obviously needs some search criteria to find relevant jobs (e.g. via an SQL query) to get the numbers for the comparison and to re-trigger the jobs.
Judging by the creation date of the epic and jobs the problem was noticed around 2021-09-21. So we would likely need to boot into a snapshot a few weeks before that.

Actions

Copy link

Updated by livdywan over 3 years ago

mkittler wrote:

I wouldn't know how to login to duck-norris.qam.suse.de - apparently the machine isn't using our normal salt config. Even if I could login, I'm not sure whether it makes sense to play around on a host which is actually administrated by somebody else.

I suggest you check with @ph03nix - that was obvious to me so I failed to add it explicitly, sorry about that.

Actions

Copy link

Updated by livdywan over 3 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 3 years ago

Description updated (diff)
Category set to Regressions/Crashes

Please consider debugging on any other host than the one we maintain within o3 or osd as out of scope. I updated the description accordingly.

See
https://progress.opensuse.org/issues/99030#Steps-to-reproduce to find tests with the same symptom. By now it seems it's actually pretty hard to reproduce. Obviously waiting longer and longer makes it even harder to investigate. I suggest for anyone from the team to simply pick this up and clarify with @ph03nix about how this can be reproduced nowadays, what to do, where to conduct the "package rollback experiment", etc.

Actions

Copy link

Updated by mkittler over 3 years ago

Assignee set to mkittler

Actions

Copy link

Updated by tbaev over 3 years ago

I have a way to reproduce the test, this is from yesterday I have run it on a up to date system. I can restart the test and it will fail again. To have access you can reach out to me on slack.

http://d453.qam.suse.de/tests/624

openqa-clone-job \ --from openqa.qam.suse.cz \ -v 30582 \ --host localhost \ --clone-children \ --parental-inheritance \ WORKER_CLASS=horror \ BACKEND=ipmi \ INCIDENT_ID= \ INCIDENT_REPO= \ CASEDIR=https://github.com/tbaev/os-autoinst-distri-opensuse.git#parallel_guest_install

Actions

Copy link

Updated by okurz over 3 years ago

Old packages are kept in the path /var/cache/zypp/packages/

Actions

Copy link

Updated by okurz over 3 years ago

Status changed from Workable to Rejected
Assignee changed from mkittler to okurz

By now unfortunately it's not feasible anymore to simply try a rollback.

I used openqa-query-for-job-label poo#99030 and found no matches

I created https://github.com/os-autoinst/scripts/pull/119 and with that I could run a longer query going back for 90 days interval='90 day' openqa-query-for-job-label poo#99030:

7221423|2021-09-25 00:49:56|done|incomplete|sle-micro_containers|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7220091|2021-09-24 21:11:37|done|incomplete|ltp_cve_git|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7224329|2021-09-24 16:16:11|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
7207682|2021-09-23 21:51:28|done|incomplete|engines_and_tools_podman|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7172003|2021-09-20 09:17:44|done|incomplete|ltp_syscalls_debug_pagealloc|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7171999|2021-09-20 09:13:19|done|incomplete|kernel-live-patching|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1

as the current ticket applies to changes on OSD infrastructure only and we don't see the problem there anymore we should reject and continue with other tasks in the parent epic.

Actions

Copy link

Also available in: Atom PDF

Project

General

Tags

Custom queries

Profile

QA (public) » openQA Project (public)

action #99111

Confirm or disprove that openQA bare-metal test loses SSH connection due to package updates size:M

Motivation¶

Acceptance criteria¶

Suggestion¶

Out of scope¶

Updated by mkittler over 3 years ago

Updated by livdywan over 3 years ago

Updated by livdywan over 3 years ago

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by tbaev over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago