coordination #99030
Updated by ph03nix over 2 years ago
## Observation In recent test runs we are seeing the following error multiple times in a row: ~~~ backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89. ~~~ See http://duck-norris.qam.suse.de/tests/7274 (and the cloned jobs within). Also other jobs are affected e.g. https://openqa.suse.de/tests/7171999, http://openqa.qam.suse.cz/tests/28072 No workaround is possible and this is a major blocker, as part of the bare-metal test runs are not able to complete. ## Steps to reproduce Run a bare-metal test run on conan.qam.suse.de (high chance of failure) or on openqa.qam.suse.cz (lower chance of failure). Find jobs referencing this ticket with the help of https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label , call `openqa-query-for-job-label poo#99030` Schedule a test run on openqa.qam.suse.cz (replace host with your own instance) ~~~ openqa-cli api --host http://openqa.qam.suse.cz -X POST isos ARCH="x86_64" DISTRI="sle" VERSION="15-SP3" FLAVOR="Server-DVD-Virt-Incidents" BUILD=":12345:qemu" INCIDENT_REPO="" ~~~ Restring job http://duck-norris.qam.suse.de/tests/7269 might also act as an reproducer (Ping @ph03nix for access on that machine) ## Acceptance criteria - **AC1**: Bare metal tests don't die ## Problem Hypothesis: ssh connection drops. ## Suggestion * Provide a way for the backend to reconnect a dropped ssh connection * Use [auto-review](https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger) with automatic retriggering, which avoids manual intervention of the same action and gives us more data about affected jobs/machines/architectures * Rollback package updates, i.e. older rpms that have we stored in local caches or btrfs snapshots of the root fs. ## Workaround * None possible, this has an major impact on our virtualization test runs and (probably) other bare-metal test runs as well.