action #178324
closedcoordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)
coordination #125708: [epic] Future ideas for more stable non-qemu backends
[sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:S
0%
Description
Observation¶
After #176868 was resolved the timeout for syncing assets is enforced correctly when it expires:
[2025-03-04T01:57:33.064193Z] [debug] [pid:34725] Using existing SSH connection (key:hostname=s390zl12.oqa.prg2.suse.org,username=root,port=22)
[2025-03-04T02:12:35.031481Z] [debug] [pid:34725] [run_ssh_cmd(rsync --timeout='150' --stats -av '/var/lib/openqa/share/factory/hdd/sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
sending incremental file list
sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
…
Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.
Net::SSH2::die_with_error(Net::SSH2=SCALAR(0x562b7a3cd8c8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 1328
backend::baseclass::run_ssh_cmd(backend::svirt=HASH(0x562b7a84d618), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "username", "root", "hostname", "s390zl12.oqa.prg2.suse.org", "password", "Nots3cr3t-\@3-vt", ...) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 674
consoles::sshVirtsh::run_cmd(consoles::sshVirtsh=HASH(0x562b79c875c8), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "timeout", 900) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 396
The underlying problem that this rsync command can take very long (or even gets stuck) hasn't been resolved, though. The problem affected multiple jobs (see #176868#note-17) but more recent jobs look good again.
We need to figure out whether the download is really that slow or whether the SSH connection is for some reason going stale. In the last case it would perhaps help to retry the download with a fresh connection.
Steps to reproduce¶
Find jobs referencing this ticket with the help of
https://raw.githubusercontent.com/os-autoinst/scripts/master/openqa-query-for-job-label ,
call openqa-query-for-job-label poo#178324
Suggestions¶
- Look into https://openqa.suse.de/tests/16929919#next_previous and consider cloning e.g. 100 jobs latest job
- Maybe it helps to increase the timeout from 15 minutes to e.g. 25 minutes by setting
SVIRT_ASSET_DOWNLOAD_TIMEOUT_M
or changing the default in os-autoinst. Note that before this timeout was enforced jobs typically ran into the overall job timeout (see #176076) so this is actually unlikely to help. - Checkout https://openqa.suse.de/tests/overview?modules=bootloader_zkvm&modules_result=failed for problematic jobs (not all of them are about the same issue!)
- Adjust autoreview regex as needed
- Try adding debug flags to rsync and store it in a file other than os-autoinst-log.txt
Updated by mkittler about 2 months ago
- Copied from action #176868: Random SSH failures on s390x workers size:S added
Updated by livdywan about 2 months ago · Edited
https://openqa.suse.de/tests/16727691#next_previous shows the last occurrence was 22 days ago. Do we know about other cases? Can we use a regex to find those Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT)
?
Otherwise it might be okay to move to Next.
Updated by MDoucha about 2 months ago
livdywan wrote in #note-3:
https://openqa.suse.de/tests/16727691#next_previous shows the last occurrence was 22 days ago. Do we know about other cases? Can we use a regex to find those?
Otherwise it might be okay to move to Next.
We have two separate flavors where s390x boots fail and the one above gets scheduled once a month. The other gets new jobs nearly every day. See poo#176868 for the most recent example failures.
Updated by livdywan about 2 months ago
We have two separate flavors where s390x boots fail and the one above gets scheduled once a month. The other gets new jobs nearly every day. See poo#176868 for the most recent example failures.
Ah. Somehow missed those. Thanks. https://openqa.suse.de/tests/16929919#next_previous looks closer to every other day then.
Updated by okurz about 2 months ago
- Subject changed from svirt s390x tests sometimes time out while syncing assets to [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler about 2 months ago · Edited
- Status changed from Workable to In Progress
- Assignee set to mkittler
I guess we haven't actually decided on what we want to do here?
For now I would just ensure that rsync prints statistics so we have something to work with.
EDIT:
--stats
are not shown when using --stop-at
:
$ rsync --stats --stop-after=1 /hdd/openqa-devel/openqa/share/factory/iso/openSUSE-Leap-15.5-DVD-x86_64-Build277.3-Media.iso{,.2}
stopping at requested limit
rsync error: timeout in data send/receive (code 30) at io.c(827) [generator=3.4.1]
stopping at requested limit
rsync error: timeout in data send/receive (code 30) at io.c(827) [sender=3.4.1]
$ echo $?
30
So we can't just limit rsync itself and let it print its statistics at the end.
I also haven't found a way to write just the debug output to a separate file. We probably also wanted --progress
to see the speed while the download is ongoing. However, that would be way to verbose for our log files.
We could also add a retry using --partial
so the transfer wouldn't have to start again from scratch. This might be better than simply extending the timeout because initially we had the problem that jobs ran into the overall job timeout (so something seems to get stuck here and maybe just retrying with a new rsync call or ssh connection helps).
Updated by openqa_review about 2 months ago
- Due date set to 2025-03-22
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 1 month ago
- Status changed from In Progress to Feedback
Updated by mkittler about 1 month ago
- Related to action #179077: [qe-core]test fails in bootloader_start on s390x. speed of downloading initrd file is slow added
Updated by livdywan about 1 month ago · Edited
- Status changed from Feedback to In Progress
Updated by livdywan about 1 month ago
- Due date changed from 2025-03-22 to 2025-03-28
- Status changed from In Progress to Feedback
I can see successful job runs and conversely no failing jobs matching this ticket for almost a week. Are we good for this ticket then?
Updated by mkittler about 1 month ago · Edited
I had to go back on the retry-approach with https://github.com/os-autoinst/os-autoinst/pull/2681.
The PR https://github.com/os-autoinst/os-autoinst/pull/2682 (draft) would possibly bring it back. However, I must say that I'm wondering whether it makes sense to further go down that rabbit hole (for this S-sized ticket). This change might or might not help or cause unwanted side-effects. I can of course test it by taking a production worker slot be as we've already seen we'll only see the real effect of these kinds of changes in production.
The current state (we could leave this in) is that these tests simply sometimes time out with the clear error message that a specific part of the test has just taken too long. We still don't know why it sometimes takes too long and whether it is really just very slow (so retrying doesn't help anyway) or completely stuck (so retrying might actually help).
Updated by mkittler about 1 month ago
- Copied to action #179404: [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry added
Updated by mkittler about 1 month ago
- Status changed from Feedback to Resolved
I created #179404 as a follow-up to avoid going deeper into the rabbit whole for now.