Project

General

Profile

Actions

coordination #99030

closed

coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation

[epic] openQA bare-metal test dies due to lost SSH connection auto_review:"backend died: Lost SSH connection to SUT: Failure while draining incoming flow":retry

Added by ph03nix over 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-09-23
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

In recent test runs we are seeing the following error multiple times in a row:

backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.

See http://duck-norris.qam.suse.de/tests/7274 (and the cloned jobs within). Also other jobs are affected e.g. https://openqa.suse.de/tests/7171999, http://openqa.qam.suse.cz/tests/28072

No workaround is possible and this is a major blocker, as part of the bare-metal test runs are not able to complete.

Steps to reproduce

Run a bare-metal test run on conan.qam.suse.de (high chance of failure) or on openqa.qam.suse.cz (lower chance of failure).

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#99030

Schedule a test run on openqa.qam.suse.cz (replace host with your own instance)

openqa-cli api --host http://openqa.qam.suse.cz -X POST isos ARCH="x86_64" DISTRI="sle" VERSION="15-SP3" FLAVOR="Server-DVD-Virt-Incidents" BUILD=":12345:qemu" INCIDENT_REPO=""

Restring job http://duck-norris.qam.suse.de/tests/7269 might also act as an reproducer (Ping @ph03nix for access on that machine)

Acceptance criteria

  • AC1: Bare metal tests don't die

Problem

Hypothesis: ssh connection drops.

Suggestion

  • Provide a way for the backend to reconnect a dropped ssh connection
  • Use auto-review with automatic retriggering, which avoids manual intervention of the same action and gives us more data about affected jobs/machines/architectures
  • Rollback package updates, i.e. older rpms that have we stored in local caches or btrfs snapshots of the root fs.

Workaround

  • None possible, this has an impact on our virtualization test runs and (probably) other bare-metal test runs as well.

Subtasks 2 (0 open2 closed)

action #99108: Automatically retry openQA bare-metal tests size:SResolvedokurz2021-09-23

Actions
action #99111: Confirm or disprove that openQA bare-metal test loses SSH connection due to package updates size:MRejectedokurz2021-09-23

Actions
Actions

Also available in: Atom PDF