coordination #99030
Updated by livdywan over 3 years ago
## Observation In recent test runs we are seeing the following error multiple times in a row: ~~~ backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89. ~~~ See http://duck-norris.qam.suse.de/tests/7274 (and the cloned jobs within). Also other jobs are affected e.g. https://openqa.suse.de/tests/7171999, http://openqa.qam.suse.cz/tests/28072 No workaround is possible and this is a major blocker, as part of the bare-metal test runs are not able to complete. ## Steps to reproduce Run a bare-metal test run on conan.qam.suse.de (high chance of failure) or on openqa.qam.suse.cz (lower chance of failure). ## Acceptance criteria - **AC1**: Bare metal tests don't die ## Problem Hypothesis: ssh connection drops. ## Suggestion * Fix the network * Provide a way for the backend to reconnect a dropped ssh connection * Use [auto-review](https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger) with automatic retriggering, which avoids manual intervention of the same action and gives us more data about affected jobs/machines/architectures * Rollback package updates, i.e. older rpms that have we stored in local caches or btrfs snapshots of the root fs. ## Workaround * None possible, this has major impact on our virtualization test runs and (probably) other bare-metal test runs as well.