action #176868
closedcoordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)
coordination #125708: [epic] Future ideas for more stable non-qemu backends
Random SSH failures on s390x workers size:S
0%
Description
s390x jobs on OSD have started randomly failing in bootloader_zkvm
due to SSH connection timeout:
https://openqa.suse.de/tests/16701073#step/bootloader_zkvm/19
# Test died: {
"function" => "add_disk",
"console" => "svirt",
"wantarray" => undef,
"args" => [
{
"file" => "/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2",
"backingfile" => 1,
"dev_id" => "a"
}
],
"json_cmd_token" => "ucjZOEwI",
"cmd" => "backend_proxy_console_call"
}
Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.
The failures appear randomly on multiple workers:
worker32:20: https://openqa.suse.de/tests/16701075#step/bootloader_zkvm/19
worker33:13: https://openqa.suse.de/tests/16701073#step/bootloader_zkvm/19
The failures appear to be temporary, these jobs ran on worker32:20 earlier today in the span of a few minutes, in this order:
pass: https://openqa.suse.de/tests/16705573
fail: https://openqa.suse.de/tests/16705328#step/bootloader_start/20
fail: https://openqa.suse.de/tests/16707171#step/bootloader_zkvm/19
fail: https://openqa.suse.de/tests/16706965#step/bootloader_zkvm/19
pass: https://openqa.suse.de/tests/16707205
Updated by okurz about 2 months ago
- Tags changed from s390x to s390x, infra, osd
- Category set to Regressions/Crashes
- Assignee set to mgriessmeier
- Priority changed from Normal to High
- Target version set to future
Updated by okurz about 2 months ago
- Tags changed from s390x, infra, osd to s390x, infra, osd, reactive work
- Assignee deleted (
mgriessmeier) - Target version changed from future to Ready
Updated by nicksinger about 2 months ago
I had a quick look at the workers (32+33 mainly) and zl12+zl13 and both look fine. Given the recent changes and the sporadic character this might not be an infra issue but related to https://progress.opensuse.org/issues/176076
Updated by nicksinger about 2 months ago
- Related to action #176076: [tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? size:M added
Updated by mkittler about 2 months ago
- Status changed from New to In Progress
- Assignee set to mkittler
Here's the error message with a bit more context:
[2025-02-08T01:44:12.144967Z] [debug] [pid:126054] <<< testapi::type_string(string="# copying image (sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2)...\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2025-02-08T01:44:16.677264Z] [debug] [pid:126054] tests/installation/bootloader_zkvm.pm:81 called bootloader_setup::zkvm_add_disk -> lib/bootloader_setup.pm:1260 called backend::console_proxy::__ANON__
[2025-02-08T01:44:16.677620Z] [debug] [pid:126054] <<< backend::console_proxy::__ANON__(wrapped_call={
"function" => "add_disk",
"console" => "svirt",
"args" => [
{
"file" => "/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2",
"backingfile" => 1,
"dev_id" => "a"
}
],
"wantarray" => undef
})
[2025-02-08T01:44:16.678961Z] [debug] [pid:126055] <<< backend::baseclass::run_ssh_cmd(cmd="rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2'", wantarray=0, password="SECRET", hostname="s390zl13.oqa.prg2.suse.org", keep_open=1, username="root")
[2025-02-08T01:44:16.679293Z] [debug] [pid:126055] <<< backend::baseclass::run_ssh(cmd="rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2'", wantarray=0, password="SECRET", hostname="s390zl13.oqa.prg2.suse.org", keep_open=1, username="root")
[2025-02-08T01:44:16.679543Z] [debug] [pid:126055] <<< backend::baseclass::new_ssh_connection(hostname="s390zl13.oqa.prg2.suse.org", keep_open=1, username="root", wantarray=0, blocking=1, password="SECRET")
[2025-02-08T01:44:16.723162Z] [debug] [pid:126055] Using existing SSH connection (key:hostname=s390zl13.oqa.prg2.suse.org,username=root,port=22)
[2025-02-08T01:49:18.079798Z] [debug] [pid:126055] [run_ssh_cmd(rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
sending incremental file list
sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
[2025-02-08T01:49:18.140128Z] [info] [pid:126054] ::: basetest::runtest: # Test died: {
"function" => "add_disk",
"console" => "svirt",
"wantarray" => undef,
"args" => [
{
"file" => "/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2",
"backingfile" => 1,
"dev_id" => "a"
}
],
"json_cmd_token" => "ucjZOEwI",
"cmd" => "backend_proxy_console_call"
}
Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.
Net::SSH2::die_with_error(Net::SSH2=SCALAR(0x559a0c2b5058)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 1334
backend::baseclass::run_ssh_cmd(backend::svirt=HASH(0x559a0c1ab1c8), "rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x"..., "hostname", "s390zl13.oqa.prg2.suse.org", "username", "root", "password", "Nots3cr3t-\@3-vt") called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 669
consoles::sshVirtsh::run_cmd(consoles::sshVirtsh=HASH(0x559a0c5bd378), "rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x"...) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 391
consoles::sshVirtsh::_copy_image_else(consoles::sshVirtsh=HASH(0x559a0c5bd378), "/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-15"..., "sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Inc"..., "/var/lib/libvirt/images/") called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 412
consoles::sshVirtsh::_copy_image_to_vm_host(consoles::sshVirtsh=HASH(0x559a0c5bd378), HASH(0x559a077e65e0), "/vmfs/volumes//openQA/", "openQA-SUT-8a.img", "openQA-SUT-8", "/var/lib/libvirt/images/", undef) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 475
consoles::sshVirtsh::add_disk(consoles::sshVirtsh=HASH(0x559a0c5bd378), HASH(0x559a077e65e0)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 822
eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 817
backend::baseclass::proxy_console_call(backend::svirt=HASH(0x559a0c1ab1c8), HASH(0x559a0b1678b8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 81
backend::baseclass::handle_command(backend::svirt=HASH(0x559a0c1ab1c8), HASH(0x559a0b179c88)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 618
backend::baseclass::check_socket(backend::svirt=HASH(0x559a0c1ab1c8), IO::Handle=GLOB(0x559a0b19d1a8), 0) called at /usr/lib/os-autoinst/backend/svirt.pm line 298
backend::svirt::check_socket(backend::svirt=HASH(0x559a0c1ab1c8), IO::Handle=GLOB(0x559a0b19d1a8), 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 286
backend::baseclass::do_capture(backend::svirt=HASH(0x559a0c1ab1c8), undef, 1738979037.69283) called at /usr/lib/os-autoinst/backend/baseclass.pm line 313
eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 313
backend::baseclass::run_capture_loop(backend::svirt=HASH(0x559a0c1ab1c8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 135
backend::baseclass::run(backend::svirt=HASH(0x559a0c1ab1c8), 13, 16) called at /usr/lib/os-autoinst/backend/driver.pm line 68
backend::driver::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c40da30)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c40da30), CODE(0x559a0b3f8970)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 493
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c40da30)) called at /usr/lib/os-autoinst/backend/driver.pm line 72
backend::driver::start(backend::driver=HASH(0x559a0bf6fc78)) called at /usr/lib/os-autoinst/backend/driver.pm line 37
backend::driver::new("backend::driver", "svirt") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 14
OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 109
OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x559a031dbd40)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 251
OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x559a031dbd40)) called at /usr/bin/isotovideo line 183
eval {...} called at /usr/bin/isotovideo line 178
at /usr/lib/os-autoinst/backend/console_proxy.pm line 46.
backend::console_proxy::__ANON__(undef, HASH(0x559a0c24f298)) called at sle/lib/bootloader_setup.pm line 1260
bootloader_setup::zkvm_add_disk(backend::console_proxy=HASH(0x559a0bd91a38)) called at sle/tests/installation/bootloader_zkvm.pm line 81
bootloader_zkvm::run(bootloader_zkvm=HASH(0x559a0a24aa70)) called at /usr/lib/os-autoinst/basetest.pm line 345
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 339
basetest::runtest(bootloader_zkvm=HASH(0x559a0a24aa70)) called at /usr/lib/os-autoinst/autotest.pm line 414
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 414
autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 271
eval {...} called at /usr/lib/os-autoinst/autotest.pm line 271
autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 325
autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c1cfb58)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c1cfb58), CODE(0x559a0c828878)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 493
Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c1cfb58)) called at /usr/lib/os-autoinst/autotest.pm line 327
autotest::start_process() called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 103
OpenQA::Isotovideo::Runner::start_autotest(OpenQA::Isotovideo::Runner=HASH(0x559a031dbd40)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 250
OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x559a031dbd40)) called at /usr/bin/isotovideo line 183
eval {...} called at /usr/bin/isotovideo line 178
I think the timestamps between those lines are relevant:
[2025-02-08T01:44:16.723162Z] [debug] [pid:126055] Using existing SSH connection (key:hostname=s390zl13.oqa.prg2.suse.org,username=root,port=22)
[2025-02-08T01:49:18.079798Z] [debug] [pid:126055] [run_ssh_cmd(rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.
This shows that there was in fact a 5 minutes delay between starting the command and printing its output after running into a timeout. The default timeout is also exactly 5 minutes so this makes actually perfectly sense. We probably should increase this timeout here a little bit, though.
Updated by mkittler about 2 months ago · Edited
- Status changed from In Progress to Feedback
This PR should fix it assuming that 15 minutes are enough to download this asset: https://github.com/os-autoinst/os-autoinst/pull/2657
The latest job passed again:
[2025-02-10T10:01:55.269564Z] [debug] [pid:69098] Using existing SSH connection (key:hostname=s390zl13.oqa.prg2.suse.org,username=root,port=22)
[2025-02-10T10:02:51.143402Z] [debug] [pid:69098] [run_ssh_cmd(rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
sending incremental file list
sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
sent 6,041,928,048 bytes received 35 bytes 106,936,780.23 bytes/sec
total size is 6,040,453,120 speedup is 1.00
[2025-02-10T10:02:51.143587Z] [debug] [pid:69098] [run_ssh_cmd(rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] exit-code: 0
[2025-02-10T10:02:51.144050Z] [debug] [pid:69098] <<< backend::baseclass::run_ssh_cmd(cmd="qemu-img info --output=json /var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2", keep_open=1, password="SECRET", hostname="s390zl13.oqa.prg2.suse.org", wantarray=1, username="root")
Here the download took only around one minute. So it can even be well within the 5 minutes timeout.
Updated by livdywan about 2 months ago
- Tags changed from s390x, infra, osd, reactive work to s390x, osd, reactive work
- Project changed from openQA Infrastructure (public) to openQA Project (public)
- Category deleted (
Regressions/Crashes)
This should not be infra.
Updated by MMoese about 2 months ago · Edited
I still see the same failures but not in a random way, instead they fail consistently, even after a couple of restarts, like for example https://openqa.suse.de/tests/16727691 .
Updated by okurz about 2 months ago
- Status changed from Feedback to In Progress
Updated by openqa_review about 2 months ago
- Due date set to 2025-02-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 2 months ago
Updated by livdywan about 2 months ago
- Subject changed from Random SSH failures on s390x workers to Random SSH failures on s390x workers size:S
Estimated as an S as-is
Updated by mkittler about 2 months ago
- Status changed from In Progress to Feedback
The change has been deployed. We now see other problems with s390x jobs but I haven't seen jobs running into a timeout anymore (only ones where the asset download worked).
Updated by mkittler about 2 months ago
- Status changed from Feedback to Resolved
I'm considering this resolved because the concrete issue hasn't occurred anymore in production. For further problems we have #177138.
Updated by MDoucha 29 days ago
- Status changed from Resolved to Workable
bootloader_zkvm
still randomly fails in add_disk
action with LIBSSH2_ERROR_TIMEOUT
. Example failures from today:
https://openqa.suse.de/tests/16944844#step/bootloader_zkvm/19
https://openqa.suse.de/tests/16944849#step/bootloader_zkvm/19
https://openqa.suse.de/tests/16944848#step/bootloader_zkvm/19
https://openqa.suse.de/tests/16944846#step/bootloader_zkvm/19
https://openqa.suse.de/tests/16944847#step/bootloader_zkvm/19
Updated by mkittler 29 days ago · Edited
- Status changed from Workable to Feedback
Judging by the timestamps and the specified timeout of 15 minutes it looks like the timeout was enforced as expected when it was exceeded:
[2025-03-04T01:57:33.064193Z] [debug] [pid:34725] Using existing SSH connection (key:hostname=s390zl12.oqa.prg2.suse.org,username=root,port=22)
[2025-03-04T02:12:35.031481Z] [debug] [pid:34725] [run_ssh_cmd(rsync --timeout='150' --stats -av '/var/lib/openqa/share/factory/hdd/sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
sending incremental file list
sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
…
Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.
Net::SSH2::die_with_error(Net::SSH2=SCALAR(0x562b7a3cd8c8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 1328
backend::baseclass::run_ssh_cmd(backend::svirt=HASH(0x562b7a84d618), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "username", "root", "hostname", "s390zl12.oqa.prg2.suse.org", "password", "Nots3cr3t-\@3-vt", ...) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 674
consoles::sshVirtsh::run_cmd(consoles::sshVirtsh=HASH(0x562b79c875c8), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "timeout", 900) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 396
Enforcing the timeout was actually meant to be an improvement (because we observed jobs getting stuck completely and only failing when running into the overall job timeout).
If 15 minutes are not enough you can now use the test variable SVIRT_ASSET_DOWNLOAD_TIMEOUT_M
to increase the timeout.
Maybe there was also just a temporary infra/networking issue when these jobs were executed as the latest jobs have passed again so it.
Updated by mkittler 28 days ago
- Copied to action #178324: [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:S added