Project

General

Profile

Actions

coordination #99030

closed

coordination #109668: [saga][epic] Stable and updated non-qemu backends for SLE validation

[epic] openQA bare-metal test dies due to lost SSH connection auto_review:"backend died: Lost SSH connection to SUT: Failure while draining incoming flow":retry

Added by ph03nix over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-09-23
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Observation

In recent test runs we are seeing the following error multiple times in a row:

backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.

See http://duck-norris.qam.suse.de/tests/7274 (and the cloned jobs within). Also other jobs are affected e.g. https://openqa.suse.de/tests/7171999, http://openqa.qam.suse.cz/tests/28072

No workaround is possible and this is a major blocker, as part of the bare-metal test runs are not able to complete.

Steps to reproduce

Run a bare-metal test run on conan.qam.suse.de (high chance of failure) or on openqa.qam.suse.cz (lower chance of failure).

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#99030

Schedule a test run on openqa.qam.suse.cz (replace host with your own instance)

openqa-cli api --host http://openqa.qam.suse.cz -X POST isos ARCH="x86_64" DISTRI="sle" VERSION="15-SP3" FLAVOR="Server-DVD-Virt-Incidents" BUILD=":12345:qemu" INCIDENT_REPO=""

Restring job http://duck-norris.qam.suse.de/tests/7269 might also act as an reproducer (Ping @ph03nix for access on that machine)

Acceptance criteria

  • AC1: Bare metal tests don't die

Problem

Hypothesis: ssh connection drops.

Suggestion

  • Provide a way for the backend to reconnect a dropped ssh connection
  • Use auto-review with automatic retriggering, which avoids manual intervention of the same action and gives us more data about affected jobs/machines/architectures
  • Rollback package updates, i.e. older rpms that have we stored in local caches or btrfs snapshots of the root fs.

Workaround

  • None possible, this has an impact on our virtualization test runs and (probably) other bare-metal test runs as well.

Subtasks 2 (0 open2 closed)

action #99108: Automatically retry openQA bare-metal tests size:SResolvedokurz2021-09-23

Actions
action #99111: Confirm or disprove that openQA bare-metal test loses SSH connection due to package updates size:MRejectedokurz2021-09-23

Actions
Actions #1

Updated by okurz over 2 years ago

  • Target version set to future

I agree that this is annoying and should if possible be fixed within the near future. Right now I don't see it feasible for SUSE QE Tools to look into as the related backend(s) are mostly maintained by testers with access to the according bare-metal test setup. So I suggest for members of SUSE QE Core or SUSE QE Kernel to look into this.

Actions #2

Updated by ph03nix over 2 years ago

  • Subject changed from openQA bare-metal test dies due to SSH connection last to openQA bare-metal test dies due to lost SSH connection
Actions #3

Updated by ph03nix over 2 years ago

okurz wrote:

I agree that this is annoying and should if possible be fixed within the near future. Right now I don't see it feasible for SUSE QE Tools to look into as the related backend(s) are mostly maintained by testers with access to the according bare-metal test setup. So I suggest for members of SUSE QE Core or SUSE QE Kernel to look into this.

Just to get this right: The IPMI backend of os-autoinst is not handled by the openQA tools team? The whole bare metal workflow is heavily relying on this and it would be crazy to rely here on the goodwill of some testers.

Actions #4

Updated by MDoucha over 2 years ago

Note: This is the exact same problem that I've reported on Monday for s390x tests on grenache (svirt backend):
https://openqa.suse.de/tests/7171999
https://openqa.suse.de/tests/7172003

Actions #5

Updated by okurz over 2 years ago

ph03nix wrote:

Just to get this right: The IPMI backend of os-autoinst is not handled by the openQA tools team? The whole bare metal workflow is heavily relying on this and it would be crazy to rely here on the goodwill of some testers.

More details on
https://progress.opensuse.org/projects/qa/wiki/Wiki#Out-of-scope
"We maintain the code for all backends but we are no experts in specific domains. So we always try to help but it's a case by case decision based on what we realistically can provide based on our competence."

Actions #6

Updated by mkittler over 2 years ago

The error sounds like the connection is interrupted while reading incoming data (before writing data to type a string). The error message comes from libssh2 itself: https://github.com/karelia/libssh2/blob/20eb836f4e763b8a55602b753c98e10647ad58de/src/channel.c#L2019

So I'd still say that general networking issues could cause this. The code looks correct and simply handles the error raised by the underlying libssh2 library. However, the croak in this code is a bit problematic. It would be more reliable to re-try and if necessary even re-establish the SSH connection. By the way, the function has been introduced last year in 6688124912400254ae6071741d4d2097d09ebc68 by @MDoucha so this is a relatively new feature. Possibly we haven't seen the problem very often before because it simply wasn't used much.

Actions #7

Updated by MDoucha over 2 years ago

mkittler wrote:

So I'd still say that general networking issues could cause this. The code looks correct and simply handles the error raised by the underlying libssh2 library. However, the croak in this code is a bit problematic. It would be more reliable to re-try and if necessary even re-establish the SSH connection. By the way, the function has been introduced last year in 6688124912400254ae6071741d4d2097d09ebc68 by @MDoucha so this is a relatively new feature. Possibly we haven't seen the problem very often before because it simply wasn't used much.

That particular part of code was used by every single LTP test on s390x, IPMI and SPVM in the past year. If we're seeing these failures only since last week, it's because something broke last week.

Actions #8

Updated by MDoucha over 2 years ago

Also note that retrying ssh_channel->write() on any error other than EAGAIN is problematic because part of the message may have been delivered to the remote system and the code has no way to tell how long that part was.

Actions #9

Updated by mkittler over 2 years ago

is problematic because part of the message may have been delivered to the remote system and the code has no way to tell how long that part was.

That's unfortunately true. So a re-try would be a re-try on an unknown state which makes it more difficult.

If we're seeing these failures only since last week, it's because something broke last week.

Yes. The code itself broke unlikely because we haven't made any changes since your initial commit. Maybe something changed on the other side of the SSH connection?

Actions #10

Updated by ph03nix over 2 years ago

I'm seeing those issues already since months, but they are very sporadic. Only recently they have become worse, that's why I filed this issue.
The assumption this only happened recently is not true.

We are also suspecting network issues as the actual culprit. My best guess is that the tcp connection gets dropped by a router/firewall because of low activity. Perhaps keepalive packets could help, it's just a wild guess though.

Actions #12

Updated by okurz over 2 years ago

please see the related issue #97334 which oorlov made me aware of today

Actions #13

Updated by okurz over 2 years ago

  • Target version changed from future to Ready

Lengthy discussion in https://suse.slack.com/archives/C02CANHLANP/p1632295742434400 but effectively I "gave up" and accepted that we should work on this ticket within SUSE QE Tools :D So let's see what we can do. I was told in the mentioned chat multiple times that "manual intervention" is often required. I don't know what that means. What I suggest is to address the urgency is to use https://github.com/os-autoinst/scripts/blob/master/README.md#auto-review---automatically-detect-known-issues-in-openqa-jobs-label-openqa-jobs-with-ticket-references-and-optionally-retrigger with automatic retriggering for a start.

EDIT: Additional suggestion by mdoucha: Rollback package updates, i.e. older rpms that have we stored in local caches or btrfs snapshots of the root fs.

Actions #14

Updated by livdywan over 2 years ago

  • Tracker changed from action to coordination
  • Subject changed from openQA bare-metal test dies due to lost SSH connection to [epic] openQA bare-metal test dies due to lost SSH connection
  • Description updated (diff)
  • Status changed from New to Workable
Actions #15

Updated by okurz over 2 years ago

  • Subject changed from [epic] openQA bare-metal test dies due to lost SSH connection to [epic] openQA bare-metal test dies due to lost SSH connection auto_review:"backend died: Lost SSH connection to SUT: Failure while draining incoming flow":retry
  • Description updated (diff)

So to clarify: As part of the initial triaging I missed to mention that there were no recent changes to the backend code and I am not aware of any related changes on any worker hosts that could explain that. For OSD, which we within SUSE QE Tools maintain, we track what changes of os-autoinst+openQA we deploy as well as the automatic package updates. For any other host system packages can have an impact and it would be good to know if there had been any updates applied since the "last good" and the "first bad" which could be used as a start for bisecting.

@ph03nix I guess it could help if you could extend the https://progress.opensuse.org/issues/99030#Steps-to-reproduce with something like a single command line, e.g. using openqa-clone-job --within-instance https://openqa.suse.de …, which would reproduce the problem

Actions #16

Updated by ph03nix over 2 years ago

okurz wrote:

@ph03nix I guess it could help if you could extend the https://progress.opensuse.org/issues/99030#Steps-to-reproduce with something like a single command line, e.g. using openqa-clone-job --within-instance https://openqa.suse.de …, which would reproduce the problem

At the moment the problem lies mostly on my own openQA instance: See http://duck-norris.qam.suse.de/tests/7274 (See the job and it's previous ones, 4 failures in a row). On openqa.suse.cz the error is not so likely to occur. This is important as duck-norris in located in Nue, while the used IPMI worker is in Prague and might suggest an underlying network issue.

I'll update the reproducer with a scheduling directive to schedule a virtualization test run. However access to an IPMI worker is necessary - See https://confluence.suse.com/display/maintenanceqa/OpenQA-IMPI+setup for details.

Actions #17

Updated by ph03nix over 2 years ago

  • Description updated (diff)
Actions #18

Updated by okurz over 2 years ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz

ph03nix wrote:

okurz wrote:

@ph03nix I guess it could help if you could extend the https://progress.opensuse.org/issues/99030#Steps-to-reproduce with something like a single command line, e.g. using openqa-clone-job --within-instance https://openqa.suse.de …, which would reproduce the problem

At the moment the problem lies mostly on my own openQA instance: See http://duck-norris.qam.suse.de/tests/7274 (See the job and it's previous ones, 4 failures in a row). On openqa.suse.cz the error is not so likely to occur. This is important as duck-norris in located in Nue, while the used IPMI worker is in Prague and might suggest an underlying network issue.

Alright. Can you please ensure that both duck-norris.qam.suse.de as well as openqa.suse.cz are consistently updated to the latest set of system packages as well as os-autoinst+openQA on openSUSE Leap 15.2?

Actions #19

Updated by nicksinger over 2 years ago

@mgriessmeier opened a Jira-Ticket for infra and they ask to provide the following information:

Can you please provide more specific information:

    Which hosts are effected? from qa subnet
    to track host connectivity from bare-metal > hosts > switch connection ..
    Can you paste any related traceroute ?
    the long thread is pointing also to mainframe issue ? is it still valid?
    I cannot quite see the job failures in the thread as it is code related. it would help to debug on the hosts itself.

I tired to understand which hosts are involved here but failed to understand. Could you please list them? As far as I can tell these machines are more in the .cz domain and not inside the NBG network?

Actions #20

Updated by nicksinger over 2 years ago

After talking with @okurz yesterday I concluded that one of the hosts where the problem appears is malbec-1 <-> s390zp[18,19].suse.de. I ran mtr over night between them and can currently observe "only" 0.1-0.3% package loss which seems just ok. I plan to run some long-lasting ssh sessions too to see if we might be able to reproduce the issue in a more simple environment.

However currently I'd also recommend implementing alternatives to ssh like mosh or autossh which are more suited for long lasting ssh sessions. My personal experience is that ssh connections drop from time to time and a perfectly stable connection can't be ensured. It also might be worthwhile to investigate if ssh can be tuned by parameters for longer timeouts or other improvements for stability.

Actions #21

Updated by ph03nix over 2 years ago

Keep-alive post: I still see this issue on my openQA instance: http://duck-norris.qam.suse.de/tests/7276

Actions #22

Updated by okurz over 2 years ago

ph03nix wrote:

Keep-alive post: I still see this issue on my openQA instance: http://duck-norris.qam.suse.de/tests/7276

hi, could you help us answering the questions in https://progress.opensuse.org/issues/99030#note-18 :

Can you please ensure that both duck-norris.qam.suse.de as well as openqa.suse.cz are consistently updated to the latest set of system packages as well as os-autoinst+openQA on openSUSE Leap 15.2?

Also nsinger and me had the suggestion to replace ssh calls with mosh or autossh, see https://progress.opensuse.org/issues/99030#note-20
We are currently using https://metacpan.org/pod/Net::SSH2 which relies on libssh2. Maybe https://metacpan.org/pod/Net::SSH2#keepalive_config(want_reply,-interval) would help. Otherwise consider https://metacpan.org/pod/Net::OpenSSH and configuring https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Multiplexing#Setting_Up_Multiplexing . As another alternative maybe https://metacpan.org/pod/Net::SSH::Perl can be used and the ssh binary could be configured for Multiplexing or replaced to use autossh or mosh instead.

Regarding likelyhood of appearance right now we have from OSD openqa-query-for-job-label poo#99030:

7261756|2021-09-29 08:45:22|done|incomplete|blktests_spvm|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89, <$fh> line 10.|grenache-1
7241276|2021-09-27 10:34:58|done|incomplete|install_ltp+sle+Server-DVD-Updates:investigate:retry|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7229941|2021-09-25 08:02:03|done|incomplete|bci_docker|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7221423|2021-09-25 00:49:56|done|incomplete|sle-micro_containers|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7220091|2021-09-24 21:11:37|done|incomplete|ltp_cve_git|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7225607|2021-09-24 19:28:00|done|incomplete|bci_podman|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7224329|2021-09-24 16:16:11|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
7222549|2021-09-24 13:07:33|done|incomplete|bci_podman|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7207682|2021-09-23 21:51:28|done|incomplete|engines_and_tools_podman|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7208187|2021-09-23 21:35:13|done|failed|qam-incidentinstall||grenache-1

so roughly two jobs per day

@MDoucha the relevant code is https://github.com/os-autoinst/os-autoinst/blob/master/consoles/ssh_screen.pm#L89 from your commit https://github.com/os-autoinst/os-autoinst/commit/6688124912400254ae6071741d4d2097d09ebc68#diff-83204dbd08fa5fb7b23a334f93dfd4cd360670f7b53d480dc352272f4725eb40 . Maybe you have some hints to ph03nix how to debug as he seems to have better reproducibility?

Actions #23

Updated by MDoucha over 2 years ago

nicksinger wrote:

After talking with @okurz yesterday I concluded that one of the hosts where the problem appears is malbec-1 <-> s390zp[18,19].suse.de. I ran mtr over night between them and can currently observe "only" 0.1-0.3% package loss which seems just ok. I plan to run some long-lasting ssh sessions too to see if we might be able to reproduce the issue in a more simple environment.

The issue on OSD most likely got fixed by https://github.com/os-autoinst/os-autoinst/pull/1783. There will be a new batch of livepatches in a few days so if the problem still exists, it'll show up there.

However currently I'd also recommend implementing alternatives to ssh like mosh or autossh which are more suited for long lasting ssh sessions. My personal experience is that ssh connections drop from time to time and a perfectly stable connection can't be ensured. It also might be worthwhile to investigate if ssh can be tuned by parameters for longer timeouts or other improvements for stability.

The whole console subsystem in os-autoinst needs complete rewrite so that the SSH implementations can use exec() instead of writing commands into remote interactive shell and parsing magic strings. That'd allow much better error handling and even some ability to recover from connection failures.

okurz wrote:

@MDoucha the relevant code is https://github.com/os-autoinst/os-autoinst/blob/master/consoles/ssh_screen.pm#L89 from your commit https://github.com/os-autoinst/os-autoinst/commit/6688124912400254ae6071741d4d2097d09ebc68#diff-83204dbd08fa5fb7b23a334f93dfd4cd360670f7b53d480dc352272f4725eb40 . Maybe you have some hints to ph03nix how to debug as he seems to have better reproducibility?

I'd start by modifying the error message at https://github.com/os-autoinst/os-autoinst/blob/f34a40a3258d1fee5e8e413cbbebf11c1b92bb77/consoles/ssh_screen.pm#L77 to include $errcode because libssh returns the same message here regardless of actual error. @ph03nix can do that locally before we merge and deploy the PR.

Actions #24

Updated by ph03nix over 2 years ago

okurz wrote:

Alright. Can you please ensure that both duck-norris.qam.suse.de as well as openqa.suse.cz are consistently updated to the latest set of system packages as well as os-autoinst+openQA on openSUSE Leap 15.2?

Yes they are. openqa.qam.suse.cz got updated last week and I update duck-norris.qam.suse.de every week.

MDoucha wrote:

okurz wrote:

@MDoucha the relevant code is https://github.com/os-autoinst/os-autoinst/blob/master/consoles/ssh_screen.pm#L89 from your commit https://github.com/os-autoinst/os-autoinst/commit/6688124912400254ae6071741d4d2097d09ebc68#diff-83204dbd08fa5fb7b23a334f93dfd4cd360670f7b53d480dc352272f4725eb40 . Maybe you have some hints to ph03nix how to debug as he seems to have better reproducibility?

I'd start by modifying the error message at https://github.com/os-autoinst/os-autoinst/blob/f34a40a3258d1fee5e8e413cbbebf11c1b92bb77/consoles/ssh_screen.pm#L77 to include $errcode because libssh returns the same message here regardless of actual error. @ph03nix can do that locally before we merge and deploy the PR.

Sure, I'll include this commit and re-trigger the failing test runs later today or tomorrow. Will report back with the results when ready.

Actions #25

Updated by MDoucha over 2 years ago

ph03nix wrote:

I'd start by modifying the error message at https://github.com/os-autoinst/os-autoinst/blob/f34a40a3258d1fee5e8e413cbbebf11c1b92bb77/consoles/ssh_screen.pm#L77 to include $errcode because libssh returns the same message here regardless of actual error. @ph03nix can do that locally before we merge and deploy the PR.

Sure, I'll include this commit and re-trigger the failing test runs later today or tomorrow. Will report back with the results when ready.

There is no commit to include yet. That's just a permalink to ensure that everyone who clicks the link will see the same highlighted line. You'll need to edit ssh_screen.pm yourself for now.

Actions #26

Updated by MDoucha over 2 years ago

PR with error code in log message:
https://github.com/os-autoinst/os-autoinst/pull/1805

Actions #27

Updated by tbaev over 2 years ago

I have run the same tests from void and from conan. The test will always fail from conan, but it passed in void.

void run:
http://openqa.qam.suse.cz/tests/28700

conan run:
http://d488.qam.suse.de/tests/952

Actions #28

Updated by xlai over 2 years ago

  • Status changed from Feedback to Workable

@okurz,
According to the feedback from qe-virtualization squad, this issue still exists, and it IMPACTS A LOT the daily tasks in maintenance virtualization scrum -- the MU testing, automation validation, new member openqa learning (tests based on ipmi backend). It slows us down a lot.

Hope you do not mind that I move the ticket back to Workable, and can help fix it with HIGH PRIORITY. Thanks a lot!

Actions #29

Updated by okurz over 2 years ago

  • Status changed from Workable to Feedback

@xlai can you please provide examples of failing tests? This ticket has an "auto_review" regex in the subject line already with the ":retry" keyword so according tests should be automatically labeled and retriggered (unless being multi-machine tests) so the impact should be lower.

Actions #30

Updated by xlai over 2 years ago

okurz wrote:

@xlai can you please provide examples of failing tests? This ticket has an "auto_review" regex in the subject line already with the ":retry" keyword so according tests should be automatically labeled and retriggered (unless being multi-machine tests) so the impact should be lower.

According to https://progress.opensuse.org/issues/99030#note-27, one example failing job can be http://d488.qam.suse.de/tests/952.

@ph03nix Will the auto job retriggering mentioned by Oliver impact our MU testing? Do you agree that the impact is lower now?

Actions #31

Updated by mkittler over 2 years ago

Not sure how to work on the sub task, see #99111#note-1.

Actions #32

Updated by okurz over 2 years ago

Looking into our database for jobs with the reason "Failure while draining incoming flow" we see the following:

openqa=> select id,result_dir,t_finished from jobs where reason ~ 'Failure while draining incoming flow' order by t_finished;
   id    |                                                                         result_dir                                                                          |     t_finished     

---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------
-
 5569780 | 05569780-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_btrfs-btrfs-001-050@s390x-kvm-sle15                                                            | 2021-03-01 23:39:59
 5569808 | 05569808-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_xfs-xfs-451-999@s390x-kvm-sle15                                                                | 2021-03-02 03:24:25
 5581434 | 05581434-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_btrfs-generic-201-300@s390x-kvm-sle15                                                          | 2021-03-03 11:18:10
 5581447 | 05581447-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_xfs-generic-301-400@s390x-kvm-sle15                                                            | 2021-03-03 11:24:31
 5616587 | 05616587-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_xfs-dangrous-tests@s390x-kvm-sle15                                                             | 2021-03-07 02:24:54
 5618128 | 05618128-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_btrfs-generic-001-100@s390x-kvm-sle15                                                          | 2021-03-07 05:53:35
 5618135 | 05618135-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_xfs-dangrous-tests@s390x-kvm-sle15                                                             | 2021-03-07 06:54:31
 5618142 | 05618142-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_btrfs-generic-001-100@s390x-kvm-sle15                                                          | 2021-03-07 07:35:34
 5618576 | 05618576-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_xfs-dangrous-tests@s390x-kvm-sle15                                                             | 2021-03-07 20:54:12
 5624501 | 05624501-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_xfs-dangrous-tests@s390x-kvm-sle15                                                             | 2021-03-09 12:37:32
 5667191 | 05667191-sle-15-SP3-Online-jpupava-s390x-Build15sp3-xfstests_xfs-dangrous-tests@s390x-kvm-sle15                                                             | 2021-03-14 13:20:47
 5909755 | 05909755-sle-15-SP3-Regression-on-Migration-from-SLE12-SPx-s390x-Build178.1-offline_sles12sp5_pscc_sdk-asmm-contm-lgm-tcm-wsm_all_full@s390x-kvm-sle12      | 2021-04-28 05:18:13
 5967063 | 05967063-sle-15-SP3-Regression-on-Migration-from-SLE12-SPx-s390x-Build183.1-offline_sles12sp4_ltss_pscc_sdk-asmm-contm-lgm-tcm-wsm_all_full@s390x-kvm-sle12 | 2021-05-07 06:49:39
 5991280 | 05991280-sle-15-SP3-Regression-on-Migration-from-SLE12-SPx-s390x-Build187.1-offline_sles12sp5_pscc_sdk-asmm-contm-lgm-tcm-wsm_all_full@s390x-kvm-sle12      | 2021-05-11 15:06:25
 6002891 | 06002891-sle-15-SP3-Regression-on-Migration-from-SLE12-SPx-s390x-Build187.1-offline_sles12sp4_ltss_pscc_sdk-asmm-contm-lgm-tcm-wsm_all_full@s390x-kvm-sle12 | 2021-05-12 05:40:48
 6974107 | 06974107-sle-15-SP4-Regression-on-Migration-from-SLE12-SPx-s390x-Build29.1-offline_sles12sp4_ltss_pscc_sdk-asmm-contm-lgm-tcm-wsm_all_full@s390x-kvm-sle12  | 2021-08-31 11:04:37
 7169955 | 07169955-sle-12-SP5-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-520.1.g9672a40-ltp_cve_git@s390x-kvm-sle12                                          | 2021-09-20 00:05:37
 7171999 | 07171999-sle-12-SP5-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-518.1.g0f8a8ca-kernel-live-patching@s390x-kvm-sle12                                 | 2021-09-20 09:13:19
 7172003 | 07172003-sle-12-SP5-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-518.1.g0f8a8ca-ltp_syscalls_debug_pagealloc@s390x-kvm-sle12                         | 2021-09-20 09:17:44
 7197569 | 07197569-sle-15-SP3-Server-DVD-Updates-s390x-Build20210922-2-mau-extratests1@s390x-kvm-sle12                                                                | 2021-09-22 21:02:23
 7200059 | 07200059-sle-15-SP3-Server-DVD-Updates-s390x-Build20210923-1-install_ltp+sle+Server-DVD-Updates@s390x-kvm-sle12                                             | 2021-09-23 04:50:06
 7207682 | 07207682-sle-15-SP1-Server-DVD-Updates-s390x-Build20210923-2-engines_and_tools_podman@s390x-kvm-sle12                                                       | 2021-09-23 21:51:28
 7211514 | 07211514-sle-15-SP1-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-76.1.g7ff24ce-install_ltp+sle+Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-sle12      | 2021-09-24 00:14:42
 7217997 | 07217997-sle-15-SP4-Online-ppc64le-Build39.1-blktests_spvm@ppc64le-spvm                                                                                     | 2021-09-24 09:56:20
 7220091 | 07220091-sle-15-Server-DVD-Incidents-Kernel-KOTD-s390x-Build4.12.14-158.1.ga4b5f51-ltp_cve_git@s390x-kvm-sle12                                              | 2021-09-24 21:11:37
 7221423 | 07221423-sle-micro-5.1-MicroOS-Image-s390x-Build62.2_12.26-sle-micro_containers@s390x-kvm-sle12                                                             | 2021-09-25 00:49:56
 7261756 | 07261756-sle-15-SP4-Online-ppc64le-Build43.1-blktests_spvm@ppc64le-spvm                                                                                     | 2021-09-29 08:45:22
 7362484 | 07362484-sle-15-SP4-Online-ppc64le-Build47.1-blktests_spvm@ppc64le-spvm                                                                                     | 2021-10-09 16:56:25
 7485994 | 07485994-sle-15-SP4-Online-ppc64le-Build52.1-blktests_spvm@ppc64le-spvm                                                                                     | 2021-10-20 16:07:13
 7544533 | 07544533-sle-15-SP3-Container-Image-Updates-s390x-Build17.8.23-bci_on_SLES_15_host_docker@s390x-kvm-sle12                                                   | 2021-10-27 11:25:53
(30 rows)

so the specific error message could actually be seen already in 2021-03. There could have been many more jobs it is just that the job group https://openqa.suse.de/admin/job_templates/293 is configured to keep results for long. This is why we still see such jobs. Then in 2021-09 there seem to have been multiple occurences but only three in 2021-10:

so certainly not a big impact. One of these tests is linked to https://bugzilla.suse.com/show_bug.cgi?id=1191116 so if the SUT runs into a kernel crash that could explain a lost connection.

Who can provide more insight on the impact? Is the issue really that severe? Did it go away in the meantime?

Actions #33

Updated by xlai over 2 years ago

okurz wrote:

Looking into our database for jobs with the reason "Failure while draining incoming flow" we see the following:

so certainly not a big impact. One of these tests is linked to https://bugzilla.suse.com/show_bug.cgi?id=1191116 so if the SUT runs into a kernel crash that could explain a lost connection.

@okurz, thanks for the efforts on it. In the above database search, do you count in openqa.qam.suse.cz? This is the major openqa instance for VT MU validation which meets this issue much more than before from this month. But I am not sure whether the situation changes now.

Who can provide more insight on the impact? Is the issue really that severe? Did it go away in the meantime?

@ph03nix @pdostal @tbaev Would you please provide your observations?

Actions #34

Updated by ph03nix over 2 years ago

okurz wrote:

Who can provide more insight on the impact? Is the issue really that severe? Did it go away in the meantime?

For the severity of the issue: This prevents us from doing virtualization testing on our development machines (which are in NUE). It does not impact our current runs on openqa.qam.suse.cz (which is in Prague).

I didn't had any time to look back if the issue is resolved in the meantime perhaps @tbaev can help us out here?
However I don't share the observation that this issue is related to a kernel crash, as it happens on a system without known issues of this kind (especially no maintenance updates included) and the same runs work fine on openqa.qam.suse.cz. So while this might be the reason for other failures of similar kind, I don't believe (!!) it is linked to the issues we observe.

Actions #35

Updated by mkittler over 2 years ago

@xlai No, @okurz didn't count in jobs running on openqa.qam.suse.cz. I've just tried doing the query on openqa.qam.suse.cz but I cannot login via https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/180aa368f6cc46755d42d4262167b6aa181feefc/sshd/users.sls#L70 as the system doesn't use our usual salt setup.

For investigating #99111 it would be good to know when the issue actually started and thus being able to query the database on the host which is affected by it. We needed to query at which point the number of jobs incompleting with "Failure while draining incoming flow" has increased significantly.

However, according to @ph03nix, the issue doesn't actually impact current runs on openqa.qam.suse.cz. So maybe it makes more sense to compare that host to e.g. d453.qam.suse.de where tests actually fail reproducibility (see #99111#note-6). In this case it is still a problem that I cannot login on openqa.qam.suse.cz to find out what version of packages is installed on that system (and compare it to d453.qam.suse.de). Possibly also some kind of configuration makes the difference.

Actions #36

Updated by xlai over 2 years ago

mkittler wrote:

@xlai No, @okurz didn't count in jobs running on openqa.qam.suse.cz. I've just tried doing the query on openqa.qam.suse.cz but I cannot login via https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/180aa368f6cc46755d42d4262167b6aa181feefc/sshd/users.sls#L70 as the system doesn't use our usual salt setup.

For investigating #99111 it would be good to know when the issue actually started and thus being able to query the database on the host which is affected by it. We needed to query at which point the number of jobs incompleting with "Failure while draining incoming flow" has increased significantly.

However, according to @ph03nix, the issue doesn't actually impact current runs on openqa.qam.suse.cz. So maybe it makes more sense to compare that host to e.g. d453.qam.suse.de where tests actually fail reproducibility (see #99111#note-6). In this case it is still a problem that I cannot login on openqa.qam.suse.cz to find out what version of packages is installed on that system (and compare it to d453.qam.suse.de). Possibly also some kind of configuration makes the difference.

@tbaev Would you please help @mkittler with the login to openqa.qam.suse.cz? Just a reminder to @mkittler , this is the official openqa instance for maintenance update testing, so be careful.

Thanks a lot for the efforts on the issue!

Actions #37

Updated by okurz over 2 years ago

I used openqa-query-for-job-label poo#99030 and found no matches over the period of the last 30 days. I created https://github.com/os-autoinst/scripts/pull/119 and with that I could run interval='90 day' openqa-query-for-job-label poo#99030 and got

7221423|2021-09-25 00:49:56|done|incomplete|sle-micro_containers|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7220091|2021-09-24 21:11:37|done|incomplete|ltp_cve_git|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7224329|2021-09-24 16:16:11|done|failed|qam-sles4sap_online_dvd_gnome_hana_nvdimm||grenache-1
7207682|2021-09-23 21:51:28|done|incomplete|engines_and_tools_podman|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7172003|2021-09-20 09:17:44|done|incomplete|ltp_syscalls_debug_pagealloc|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1
7171999|2021-09-20 09:13:19|done|incomplete|kernel-live-patching|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1

so the last time we saw this exact error seems to be on 2021-09-20 and not anymore since then. This applies to o3 and osd where we use auto-review and have that called in job_done hooks. For anyone working on other instances you could do the same to gather recent statistics.

Actions #38

Updated by xlai over 2 years ago

xlai wrote:

mkittler wrote:

@xlai No, @okurz didn't count in jobs running on openqa.qam.suse.cz. I've just tried doing the query on openqa.qam.suse.cz but I cannot login via https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/180aa368f6cc46755d42d4262167b6aa181feefc/sshd/users.sls#L70 as the system doesn't use our usual salt setup.

For investigating #99111 it would be good to know when the issue actually started and thus being able to query the database on the host which is affected by it. We needed to query at which point the number of jobs incompleting with "Failure while draining incoming flow" has increased significantly.

However, according to @ph03nix, the issue doesn't actually impact current runs on openqa.qam.suse.cz. So maybe it makes more sense to compare that host to e.g. d453.qam.suse.de where tests actually fail reproducibility (see #99111#note-6). In this case it is still a problem that I cannot login on openqa.qam.suse.cz to find out what version of packages is installed on that system (and compare it to d453.qam.suse.de). Possibly also some kind of configuration makes the difference.

@tbaev Would you please help @mkittler with the login to openqa.qam.suse.cz? Just a reminder to @mkittler , this is the official openqa instance for maintenance update testing, so be careful.

Thanks a lot for the efforts on the issue!

@mkittler I asked @tbaev the login to openqa.qam.suse.cz and shared you with email. If anything else needed, please let me know.

Actions #39

Updated by tbaev over 2 years ago

I have encounter the SSH error again http://d453.qam.suse.de/tests/794 if this is helpful. The instance is up to date from today.

Actions #40

Updated by mkittler over 2 years ago

I've logged into openqa.qam.suse.cz. The machine is still on Leap 15.2 (and the development VMs where the issue can currently be reproduced are already on Leap 15.3). That might make a difference. However, libssh2 has the same upstream version in both cases (1.9.0). The same counts for perl-Net-SSH2 (0.69¹). Perl is also the same (5.26.1).

I would conclude that recent updates of libssh2 or perl-Net-SSH2 did not make a difference here. They haven't even been updated recently. (The libssh2 version is from 2019-06-20 and the perl-Net-SSH2 version from 2018-2-24.)


¹ The current version would actually be 0.72 but it isn't installed on those systems because they don't use the version from devel:openQA due to the repository priorities. I would generally not recommend this kind of setup and prefer devel:openQA repos instead using priorities like the official documentation suggests.

Actions #41

Updated by okurz over 2 years ago

mkittler wrote:

The current version would actually be 0.72 but it isn't installed on those systems because they don't use the version from devel:openQA due to the repository priorities. I would generally not recommend this kind of setup and prefer devel:openQA repos instead using priorities like the official documentation suggests.

I agree. As this issue is not easily reproducible for us I suggest that you guys ensure that the affected machines are updated accordingly and see if the problem persists.

Actions #42

Updated by xlai over 2 years ago

okurz wrote:

mkittler wrote:

The current version would actually be 0.72 but it isn't installed on those systems because they don't use the version from devel:openQA due to the repository priorities. I would generally not recommend this kind of setup and prefer devel:openQA repos instead using priorities like the official documentation suggests.

I agree. As this issue is not easily reproducible for us I suggest that you guys ensure that the affected machines are updated accordingly and see if the problem persists.

@mkittler @okurz,
According to https://progress.opensuse.org/issues/99030#note-39, the answer should be yes.

From recent results, it seems the issue happens less than the time that Felix reported it, only occasionally now. So we can not make an easy reproducer of the issue for you to debug either, sorry.

May I make a suggestion? Given above fact, can tools team do earilier preparation in openqa code, so that when the issue happens again on whichever openqa server/ipmi SUT, from the job debug logs, you can find out why, what is happening, and how to fix? I guess it will not be easy, but I do not have other good suggestions now.

Actions #43

Updated by okurz over 2 years ago

xlai wrote:

can tools team do earilier preparation in openqa code […]

Sorry, I don't understand. What do you mean by that?

Actions #44

Updated by xlai over 2 years ago

okurz wrote:

xlai wrote:

can tools team do earilier preparation in openqa code […]

Sorry, I don't understand. What do you mean by that?

I am suggesting that, if possible, maybe debug code for this specific issue can be added in relevant openqa code blocks, so that once the issue happens again, from the job's debug log, you can find out what's happening and maybe figure out some way to fix. Then it won't be needed to struggle hard to think about how to reproduce the problem locally in your environment. IMHO, this issue is triggered by specific network condition, which is not easy to simulate.

Actions #45

Updated by livdywan about 2 years ago

Being an epic, this ticket has no priority, but existing subtasks were High and we should define new ones at the next opportunity

Actions #46

Updated by livdywan almost 2 years ago

cdywan wrote:

Being an epic, this ticket has no priority, but existing subtasks were High and we should define new ones at the next opportunity

Little reminder, this epic still has no sub tasks and is effectively High priority

Actions #47

Updated by okurz almost 2 years ago

  • Status changed from Feedback to Resolved
  • Parent task set to #109668

I now called interval='300 day' openqa-query-for-job-label poo#99030 and only found a single entry

7172003|2021-09-20 09:17:44|done|incomplete|ltp_syscalls_debug_pagealloc|backend died: Lost SSH connection to SUT: Failure while draining incoming flow at /usr/lib/os-autoinst/consoles/ssh_screen.pm line 89.|grenache-1

so no further occurences sine then. As there was also no other mentions I assume that OS or package upgrades might have helped to resolve the issues.

Actions

Also available in: Atom PDF