action #178324
closedcoordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)
coordination #125708: [epic] Future ideas for more stable non-qemu backends
[sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:S
0%
Description
Observation¶
After #176868 was resolved the timeout for syncing assets is enforced correctly when it expires:
[2025-03-04T01:57:33.064193Z] [debug] [pid:34725] Using existing SSH connection (key:hostname=s390zl12.oqa.prg2.suse.org,username=root,port=22)
[2025-03-04T02:12:35.031481Z] [debug] [pid:34725] [run_ssh_cmd(rsync --timeout='150' --stats -av '/var/lib/openqa/share/factory/hdd/sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
sending incremental file list
sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
…
Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.
Net::SSH2::die_with_error(Net::SSH2=SCALAR(0x562b7a3cd8c8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 1328
backend::baseclass::run_ssh_cmd(backend::svirt=HASH(0x562b7a84d618), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "username", "root", "hostname", "s390zl12.oqa.prg2.suse.org", "password", "Nots3cr3t-\@3-vt", ...) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 674
consoles::sshVirtsh::run_cmd(consoles::sshVirtsh=HASH(0x562b79c875c8), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "timeout", 900) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 396
The underlying problem that this rsync command can take very long (or even gets stuck) hasn't been resolved, though. The problem affected multiple jobs (see #176868#note-17) but more recent jobs look good again.
We need to figure out whether the download is really that slow or whether the SSH connection is for some reason going stale. In the last case it would perhaps help to retry the download with a fresh connection.
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#178324
Suggestions¶
- Look into https://openqa.suse.de/tests/16929919#next_previous and consider cloning e.g. 100 jobs latest job
- Maybe it helps to increase the timeout from 15 minutes to e.g. 25 minutes by setting
SVIRT_ASSET_DOWNLOAD_TIMEOUT_M
or changing the default in os-autoinst. Note that before this timeout was enforced jobs typically ran into the overall job timeout (see #176076) so this is actually unlikely to help. - Checkout https://openqa.suse.de/tests/overview?modules=bootloader_zkvm&modules_result=failed for problematic jobs (not all of them are about the same issue!)
- Adjust autoreview regex as needed
- Try adding debug flags to rsync and store it in a file other than os-autoinst-log.txt