Project

General

Profile

action #99111

coordination #99030: [epic] openQA bare-metal test dies due to lost SSH connection auto_review:"backend died: Lost SSH connection to SUT: Failure while draining incoming flow":retry

Confirm or disprove that openQA bare-metal test loses SSH connection due to package updates size:M

Added by cdywan 2 months ago. Updated 20 days ago.

Status:
Rejected
Priority:
High
Assignee:
Category:
Concrete Bugs
Target version:
Start date:
2021-09-23
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Motivation

See http://duck-norris.qam.suse.de/tests/7274 (and the cloned jobs within). Also other jobs are affected e.g. https://openqa.suse.de/tests/7171999, http://openqa.qam.suse.cz/tests/28072

Acceptance criteria

  • AC1: Epic is updated to clarify if package updates contributed to the problem

Suggestion

  • Rollback package updates, i.e. older rpms that have we stored in local caches or btrfs snapshots of the root fs.
  • Trigger tests after rollback and verify pass rate

Out of scope

  • Debugging on any host outside O3 or OSD (was: Check with ph03nix for login to duck-norris.qam.suse.de if needed)

History

#1 Updated by mkittler about 1 month ago

  • I wouldn't know how to login to duck-norris.qam.suse.de - apparently the machine isn't using our normal salt config. Even if I could login, I'm not sure whether it makes sense to play around on a host which is actually administrated by somebody else.
  • For gathering statistics it would be good to know what kind of jobs we're talking about. Any kind of job with BACKEND=ipmi or BACKEND=svirt? One obviously needs some search criteria to find relevant jobs (e.g. via an SQL query) to get the numbers for the comparison and to re-trigger the jobs.
  • Judging by the creation date of the epic and jobs the problem was noticed around 2021-09-21. So we would likely need to boot into a snapshot a few weeks before that.

#2 Updated by cdywan about 1 month ago

mkittler wrote:

  • I wouldn't know how to login to duck-norris.qam.suse.de - apparently the machine isn't using our normal salt config. Even if I could login, I'm not sure whether it makes sense to play around on a host which is actually administrated by somebody else.

I suggest you check with ph03nix - that was obvious to me so I failed to add it explicitly, sorry about that.

#3 Updated by cdywan about 1 month ago

  • Description updated (diff)

#4 Updated by okurz about 1 month ago

  • Description updated (diff)
  • Category set to Concrete Bugs

Please consider debugging on any other host than the one we maintain within o3 or osd as out of scope. I updated the description accordingly.

See
https://progress.opensuse.org/issues/99030#Steps-to-reproduce to find tests with the same symptom. By now it seems it's actually pretty hard to reproduce. Obviously waiting longer and longer makes it even harder to investigate. I suggest for anyone from the team to simply pick this up and clarify with ph03nix about how this can be reproduced nowadays, what to do, where to conduct the "package rollback experiment", etc.

#5 Updated by mkittler 27 days ago

  • Assignee set to mkittler

#6 Updated by tbaev 25 days ago

I have a way to reproduce the test, this is from yesterday I have run it on a up to date system. I can restart the test and it will fail again. To have access you can reach out to me on slack.

http://d453.qam.suse.de/tests/624

openqa-clone-job \
--from openqa.qam.suse.cz \
-v 30582 \
--host localhost \
--clone-children \
--parental-inheritance \
WORKER_CLASS=horror \
BACKEND=ipmi \
INCIDENT_ID= \
INCIDENT_REPO= \
CASEDIR=https://github.com/tbaev/os-autoinst-distri-opensuse.git#parallel_guest_install

#7 Updated by okurz 21 days ago

Old packages are kept in the path /var/cache/zypp/packages/

#8 Updated by okurz 20 days ago

  • Status changed from Workable to Rejected
  • Assignee changed from mkittler to okurz

By now unfortunately it's not feasible anymore to simply try a rollback.

I used openqa-query-for-job-label poo#99030 and found no matches

I created https://github.com/os-autoinst/scripts/pull/119 and with that I could run a longer query going back for 90 days interval='90 day' openqa-query-for-job-label poo#99030:

7221423|2021-09-25 00:49:56|done|incomplete|sle-micro_containers|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7220091|2021-09-24 21:11:37|done|incomplete|ltp_cve_git|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7224329|2021-09-24 16:16:11|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
7207682|2021-09-23 21:51:28|done|incomplete|engines_and_tools_podman|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7172003|2021-09-20 09:17:44|done|incomplete|ltp_syscalls_debug_pagealloc|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7171999|2021-09-20 09:13:19|done|incomplete|kernel-live-patching|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1

as the current ticket applies to changes on OSD infrastructure only and we don't see the problem there anymore we should reject and continue with other tasks in the parent epic.

Also available in: Atom PDF