action #179404
opencoordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)
coordination #125708: [epic] Future ideas for more stable non-qemu backends
[sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry
0%
Description
The problem¶
svirt s390x tests sometimes time out while syncing assets. These tests simply now time out with a clear error message on the specific part that took too long. The timeout can also be overridden by setting the test variable SVIRT_ASSET_DOWNLOAD_TIMEOUT_M
. We still don't know why it sometimes takes too long and whether it is really just very slow (so retrying doesn't help anyway) or completely stuck (so retrying might actually help).
Original observation¶
After #176868 was resolved the timeout for syncing assets is enforced correctly when it expires:
[2025-03-04T01:57:33.064193Z] [debug] [pid:34725] Using existing SSH connection (key:hostname=s390zl12.oqa.prg2.suse.org,username=root,port=22)
[2025-03-04T02:12:35.031481Z] [debug] [pid:34725] [run_ssh_cmd(rsync --timeout='150' --stats -av '/var/lib/openqa/share/factory/hdd/sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
sending incremental file list
sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
…
Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.
Net::SSH2::die_with_error(Net::SSH2=SCALAR(0x562b7a3cd8c8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 1328
backend::baseclass::run_ssh_cmd(backend::svirt=HASH(0x562b7a84d618), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "username", "root", "hostname", "s390zl12.oqa.prg2.suse.org", "password", "Nots3cr3t-\@3-vt", ...) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 674
consoles::sshVirtsh::run_cmd(consoles::sshVirtsh=HASH(0x562b79c875c8), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "timeout", 900) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 396
The underlying problem that this rsync command can take very long (or even gets stuck) hasn't been resolved, though. The problem affected multiple jobs (see #176868#note-17) but more recent jobs look good again.
We need to figure out whether the download is really that slow or whether the SSH connection is for some reason going stale. In the last case it would perhaps help to retry the download with a fresh connection.
Ideas for improvement¶
In #178324 a retry has been implemented but we learned that a trivial implementation is not sufficient as it leads to leftover processes. The retry-mechanism is still in-place but reduced to just one attempt to mitigate the problem of leftover processes. One idea to avoid the problem of leftover processes was https://github.com/os-autoinst/os-autoinst/pull/2682.
Note that printing additional statistics is also not straight forward, see #178324#note-7. I also tried using timeout
with various signals (instead of using --stop-after=1
) but it all lead to the same result of statistics not being printed.
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#178324
Updated by mkittler 7 days ago
- Copied from action #178324: [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:S added
Updated by livdywan 7 days ago
- Subject changed from [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:S to [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry