Project

General

Profile

Actions

action #179404

open

coordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)

coordination #125708: [epic] Future ideas for more stable non-qemu backends

[sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry

Added by mkittler 7 days ago. Updated 7 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Feature requests
Target version:
Start date:
2025-02-10
Due date:
% Done:

0%

Estimated time:

Description

The problem

svirt s390x tests sometimes time out while syncing assets. These tests simply now time out with a clear error message on the specific part that took too long. The timeout can also be overridden by setting the test variable SVIRT_ASSET_DOWNLOAD_TIMEOUT_M. We still don't know why it sometimes takes too long and whether it is really just very slow (so retrying doesn't help anyway) or completely stuck (so retrying might actually help).

Original observation

After #176868 was resolved the timeout for syncing assets is enforced correctly when it expires:

[2025-03-04T01:57:33.064193Z] [debug] [pid:34725] Using existing SSH connection (key:hostname=s390zl12.oqa.prg2.suse.org,username=root,port=22)
[2025-03-04T02:12:35.031481Z] [debug] [pid:34725] [run_ssh_cmd(rsync --timeout='150' --stats -av '/var/lib/openqa/share/factory/hdd/sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
  sending incremental file list
  sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
…
  Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.
    Net::SSH2::die_with_error(Net::SSH2=SCALAR(0x562b7a3cd8c8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 1328
    backend::baseclass::run_ssh_cmd(backend::svirt=HASH(0x562b7a84d618), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "username", "root", "hostname", "s390zl12.oqa.prg2.suse.org", "password", "Nots3cr3t-\@3-vt", ...) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 674
    consoles::sshVirtsh::run_cmd(consoles::sshVirtsh=HASH(0x562b79c875c8), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "timeout", 900) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 396

The underlying problem that this rsync command can take very long (or even gets stuck) hasn't been resolved, though. The problem affected multiple jobs (see #176868#note-17) but more recent jobs look good again.

We need to figure out whether the download is really that slow or whether the SSH connection is for some reason going stale. In the last case it would perhaps help to retry the download with a fresh connection.

Ideas for improvement

In #178324 a retry has been implemented but we learned that a trivial implementation is not sufficient as it leads to leftover processes. The retry-mechanism is still in-place but reduced to just one attempt to mitigate the problem of leftover processes. One idea to avoid the problem of leftover processes was https://github.com/os-autoinst/os-autoinst/pull/2682.

Note that printing additional statistics is also not straight forward, see #178324#note-7. I also tried using timeout with various signals (instead of using --stop-after=1) but it all lead to the same result of statistics not being printed.

Steps to reproduce

Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#178324


Related issues 1 (0 open1 closed)

Copied from openQA Project (public) - action #178324: [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:SResolvedmkittler2025-02-10

Actions
Actions #1

Updated by mkittler 7 days ago

  • Copied from action #178324: [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:S added
Actions #2

Updated by livdywan 7 days ago

  • Due date deleted (2025-03-28)
Actions #3

Updated by livdywan 7 days ago

  • Subject changed from [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:S to [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry
Actions #4

Updated by okurz 7 days ago

  • Target version set to future
Actions

Also available in: Atom PDF