Project

General

Profile

Actions

action #176868

closed

coordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)

coordination #125708: [epic] Future ideas for more stable non-qemu backends

Random SSH failures on s390x workers size:S

Added by MDoucha about 2 months ago. Updated 28 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-02-10
Due date:
% Done:

0%

Estimated time:

Description

s390x jobs on OSD have started randomly failing in bootloader_zkvm due to SSH connection timeout:
https://openqa.suse.de/tests/16701073#step/bootloader_zkvm/19

# Test died: {
  "function" => "add_disk",
  "console" => "svirt",
  "wantarray" => undef,
  "args" => [
              {
                "file" => "/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2",
                "backingfile" => 1,
                "dev_id" => "a"
              }
            ],
  "json_cmd_token" => "ucjZOEwI",
  "cmd" => "backend_proxy_console_call"
}
Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.

The failures appear randomly on multiple workers:
worker32:20: https://openqa.suse.de/tests/16701075#step/bootloader_zkvm/19
worker33:13: https://openqa.suse.de/tests/16701073#step/bootloader_zkvm/19

The failures appear to be temporary, these jobs ran on worker32:20 earlier today in the span of a few minutes, in this order:
pass: https://openqa.suse.de/tests/16705573
fail: https://openqa.suse.de/tests/16705328#step/bootloader_start/20
fail: https://openqa.suse.de/tests/16707171#step/bootloader_zkvm/19
fail: https://openqa.suse.de/tests/16706965#step/bootloader_zkvm/19
pass: https://openqa.suse.de/tests/16707205


Related issues 2 (0 open2 closed)

Related to openQA Tests (public) - action #176076: [tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? size:MResolvedmkittler2025-01-23

Actions
Copied to openQA Project (public) - action #178324: [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:SResolvedmkittler2025-02-10

Actions
Actions #1

Updated by okurz about 2 months ago

  • Tags changed from s390x to s390x, infra, osd
  • Category set to Regressions/Crashes
  • Assignee set to mgriessmeier
  • Priority changed from Normal to High
  • Target version set to future
Actions #2

Updated by okurz about 2 months ago

  • Tags changed from s390x, infra, osd to s390x, infra, osd, reactive work
  • Assignee deleted (mgriessmeier)
  • Target version changed from future to Ready
Actions #3

Updated by nicksinger about 2 months ago

I had a quick look at the workers (32+33 mainly) and zl12+zl13 and both look fine. Given the recent changes and the sporadic character this might not be an infra issue but related to https://progress.opensuse.org/issues/176076

Actions #4

Updated by nicksinger about 2 months ago

  • Related to action #176076: [tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? size:M added
Actions #5

Updated by mkittler about 2 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler

Here's the error message with a bit more context:

[2025-02-08T01:44:12.144967Z] [debug] [pid:126054] <<< testapi::type_string(string="# copying image (sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2)...\n", max_interval=250, wait_screen_change=0, wait_still_screen=0, timeout=30, similarity_level=47)
[2025-02-08T01:44:16.677264Z] [debug] [pid:126054] tests/installation/bootloader_zkvm.pm:81 called bootloader_setup::zkvm_add_disk -> lib/bootloader_setup.pm:1260 called backend::console_proxy::__ANON__
[2025-02-08T01:44:16.677620Z] [debug] [pid:126054] <<< backend::console_proxy::__ANON__(wrapped_call={
    "function" => "add_disk",
    "console" => "svirt",
    "args" => [
                {
                  "file" => "/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2",
                  "backingfile" => 1,
                  "dev_id" => "a"
                }
              ],
    "wantarray" => undef
  })
[2025-02-08T01:44:16.678961Z] [debug] [pid:126055] <<< backend::baseclass::run_ssh_cmd(cmd="rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2'", wantarray=0, password="SECRET", hostname="s390zl13.oqa.prg2.suse.org", keep_open=1, username="root")
[2025-02-08T01:44:16.679293Z] [debug] [pid:126055] <<< backend::baseclass::run_ssh(cmd="rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2'", wantarray=0, password="SECRET", hostname="s390zl13.oqa.prg2.suse.org", keep_open=1, username="root")
[2025-02-08T01:44:16.679543Z] [debug] [pid:126055] <<< backend::baseclass::new_ssh_connection(hostname="s390zl13.oqa.prg2.suse.org", keep_open=1, username="root", wantarray=0, blocking=1, password="SECRET")
[2025-02-08T01:44:16.723162Z] [debug] [pid:126055] Using existing SSH connection (key:hostname=s390zl13.oqa.prg2.suse.org,username=root,port=22)
[2025-02-08T01:49:18.079798Z] [debug] [pid:126055] [run_ssh_cmd(rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
  sending incremental file list
  sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2

[2025-02-08T01:49:18.140128Z] [info] [pid:126054] ::: basetest::runtest: # Test died: {
    "function" => "add_disk",
    "console" => "svirt",
    "wantarray" => undef,
    "args" => [
                {
                  "file" => "/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2",
                  "backingfile" => 1,
                  "dev_id" => "a"
                }
              ],
    "json_cmd_token" => "ucjZOEwI",
    "cmd" => "backend_proxy_console_call"
  }
  Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.
    Net::SSH2::die_with_error(Net::SSH2=SCALAR(0x559a0c2b5058)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 1334
    backend::baseclass::run_ssh_cmd(backend::svirt=HASH(0x559a0c1ab1c8), "rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x"..., "hostname", "s390zl13.oqa.prg2.suse.org", "username", "root", "password", "Nots3cr3t-\@3-vt") called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 669
    consoles::sshVirtsh::run_cmd(consoles::sshVirtsh=HASH(0x559a0c5bd378), "rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x"...) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 391
    consoles::sshVirtsh::_copy_image_else(consoles::sshVirtsh=HASH(0x559a0c5bd378), "/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-15"..., "sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Inc"..., "/var/lib/libvirt/images/") called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 412
    consoles::sshVirtsh::_copy_image_to_vm_host(consoles::sshVirtsh=HASH(0x559a0c5bd378), HASH(0x559a077e65e0), "/vmfs/volumes//openQA/", "openQA-SUT-8a.img", "openQA-SUT-8", "/var/lib/libvirt/images/", undef) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 475
    consoles::sshVirtsh::add_disk(consoles::sshVirtsh=HASH(0x559a0c5bd378), HASH(0x559a077e65e0)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 822
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 817
    backend::baseclass::proxy_console_call(backend::svirt=HASH(0x559a0c1ab1c8), HASH(0x559a0b1678b8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 81
    backend::baseclass::handle_command(backend::svirt=HASH(0x559a0c1ab1c8), HASH(0x559a0b179c88)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 618
    backend::baseclass::check_socket(backend::svirt=HASH(0x559a0c1ab1c8), IO::Handle=GLOB(0x559a0b19d1a8), 0) called at /usr/lib/os-autoinst/backend/svirt.pm line 298
    backend::svirt::check_socket(backend::svirt=HASH(0x559a0c1ab1c8), IO::Handle=GLOB(0x559a0b19d1a8), 0) called at /usr/lib/os-autoinst/backend/baseclass.pm line 286
    backend::baseclass::do_capture(backend::svirt=HASH(0x559a0c1ab1c8), undef, 1738979037.69283) called at /usr/lib/os-autoinst/backend/baseclass.pm line 313
    eval {...} called at /usr/lib/os-autoinst/backend/baseclass.pm line 313
    backend::baseclass::run_capture_loop(backend::svirt=HASH(0x559a0c1ab1c8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 135
    backend::baseclass::run(backend::svirt=HASH(0x559a0c1ab1c8), 13, 16) called at /usr/lib/os-autoinst/backend/driver.pm line 68
    backend::driver::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c40da30)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c40da30), CODE(0x559a0b3f8970)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 493
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c40da30)) called at /usr/lib/os-autoinst/backend/driver.pm line 72
    backend::driver::start(backend::driver=HASH(0x559a0bf6fc78)) called at /usr/lib/os-autoinst/backend/driver.pm line 37
    backend::driver::new("backend::driver", "svirt") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 14
    OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 109
    OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x559a031dbd40)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 251
    OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x559a031dbd40)) called at /usr/bin/isotovideo line 183
    eval {...} called at /usr/bin/isotovideo line 178
   at /usr/lib/os-autoinst/backend/console_proxy.pm line 46.
    backend::console_proxy::__ANON__(undef, HASH(0x559a0c24f298)) called at sle/lib/bootloader_setup.pm line 1260
    bootloader_setup::zkvm_add_disk(backend::console_proxy=HASH(0x559a0bd91a38)) called at sle/tests/installation/bootloader_zkvm.pm line 81
    bootloader_zkvm::run(bootloader_zkvm=HASH(0x559a0a24aa70)) called at /usr/lib/os-autoinst/basetest.pm line 345
    eval {...} called at /usr/lib/os-autoinst/basetest.pm line 339
    basetest::runtest(bootloader_zkvm=HASH(0x559a0a24aa70)) called at /usr/lib/os-autoinst/autotest.pm line 414
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 414
    autotest::runalltests() called at /usr/lib/os-autoinst/autotest.pm line 271
    eval {...} called at /usr/lib/os-autoinst/autotest.pm line 271
    autotest::run_all() called at /usr/lib/os-autoinst/autotest.pm line 325
    autotest::__ANON__(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c1cfb58)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
    eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 329
    Mojo::IOLoop::ReadWriteProcess::_fork(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c1cfb58), CODE(0x559a0c828878)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm line 493
    Mojo::IOLoop::ReadWriteProcess::start(Mojo::IOLoop::ReadWriteProcess=HASH(0x559a0c1cfb58)) called at /usr/lib/os-autoinst/autotest.pm line 327
    autotest::start_process() called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 103
    OpenQA::Isotovideo::Runner::start_autotest(OpenQA::Isotovideo::Runner=HASH(0x559a031dbd40)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 250
    OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x559a031dbd40)) called at /usr/bin/isotovideo line 183
    eval {...} called at /usr/bin/isotovideo line 178

I think the timestamps between those lines are relevant:

[2025-02-08T01:44:16.723162Z] [debug] [pid:126055] Using existing SSH connection (key:hostname=s390zl13.oqa.prg2.suse.org,username=root,port=22)
[2025-02-08T01:49:18.079798Z] [debug] [pid:126055] [run_ssh_cmd(rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.

This shows that there was in fact a 5 minutes delay between starting the command and printing its output after running into a timeout. The default timeout is also exactly 5 minutes so this makes actually perfectly sense. We probably should increase this timeout here a little bit, though.

Actions #6

Updated by mkittler about 2 months ago · Edited

  • Status changed from In Progress to Feedback

This PR should fix it assuming that 15 minutes are enough to download this asset: https://github.com/os-autoinst/os-autoinst/pull/2657

The latest job passed again:

[2025-02-10T10:01:55.269564Z] [debug] [pid:69098] Using existing SSH connection (key:hostname=s390zl13.oqa.prg2.suse.org,username=root,port=22)
[2025-02-10T10:02:51.143402Z] [debug] [pid:69098] [run_ssh_cmd(rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
  sending incremental file list
  sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2

  sent 6,041,928,048 bytes  received 35 bytes  106,936,780.23 bytes/sec
  total size is 6,040,453,120  speedup is 1.00

[2025-02-10T10:02:51.143587Z] [debug] [pid:69098] [run_ssh_cmd(rsync -av '/var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] exit-code: 0
[2025-02-10T10:02:51.144050Z] [debug] [pid:69098] <<< backend::baseclass::run_ssh_cmd(cmd="qemu-img info --output=json /var/lib/openqa/share/factory/hdd/sle-15-SP4-s390x-5.14.21-150400.159.1.g7670126-Server-DVD-Incidents-Kernel-KOTD\@s390x-kvm-with-ltp.qcow2", keep_open=1, password="SECRET", hostname="s390zl13.oqa.prg2.suse.org", wantarray=1, username="root")

Here the download took only around one minute. So it can even be well within the 5 minutes timeout.

Actions #7

Updated by okurz about 2 months ago

  • Parent task set to #125708
Actions #8

Updated by livdywan about 2 months ago

  • Tags changed from s390x, infra, osd, reactive work to s390x, osd, reactive work
  • Project changed from openQA Infrastructure (public) to openQA Project (public)
  • Category deleted (Regressions/Crashes )

This should not be infra.

Actions #9

Updated by okurz about 2 months ago

  • Category set to Regressions/Crashes
Actions #10

Updated by MMoese about 2 months ago · Edited

I still see the same failures but not in a random way, instead they fail consistently, even after a couple of restarts, like for example https://openqa.suse.de/tests/16727691 .

Actions #11

Updated by okurz about 2 months ago

  • Status changed from Feedback to In Progress
Actions #12

Updated by openqa_review about 2 months ago

  • Due date set to 2025-02-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by livdywan about 2 months ago

  • Subject changed from Random SSH failures on s390x workers to Random SSH failures on s390x workers size:S

Estimated as an S as-is

Actions #15

Updated by mkittler about 2 months ago

  • Status changed from In Progress to Feedback

The change has been deployed. We now see other problems with s390x jobs but I haven't seen jobs running into a timeout anymore (only ones where the asset download worked).

Actions #16

Updated by mkittler about 2 months ago

  • Status changed from Feedback to Resolved

I'm considering this resolved because the concrete issue hasn't occurred anymore in production. For further problems we have #177138.

Actions #18

Updated by livdywan 29 days ago

  • Due date deleted (2025-02-26)

Resetting the due date as this was resolved before.

Actions #19

Updated by mkittler 29 days ago · Edited

  • Status changed from Workable to Feedback

Judging by the timestamps and the specified timeout of 15 minutes it looks like the timeout was enforced as expected when it was exceeded:

[2025-03-04T01:57:33.064193Z] [debug] [pid:34725] Using existing SSH connection (key:hostname=s390zl12.oqa.prg2.suse.org,username=root,port=22)
[2025-03-04T02:12:35.031481Z] [debug] [pid:34725] [run_ssh_cmd(rsync --timeout='150' --stats -av '/var/lib/openqa/share/factory/hdd/sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2' '/var/lib/libvirt/images//sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2')] stdout:
  sending incremental file list
  sle-15-SP3-s390x-5.3.18-150300.268.1.gd2bdf5f-Server-DVD-Incidents-Kernel-KOTD@s390x-kvm-with-ltp.qcow2
…
  Time out waiting for data (-9 LIBSSH2_ERROR_TIMEOUT) at /usr/lib/perl5/vendor_perl/5.26.1/x86_64-linux-thread-multi/Net/SSH2.pm line 51.
    Net::SSH2::die_with_error(Net::SSH2=SCALAR(0x562b7a3cd8c8)) called at /usr/lib/os-autoinst/backend/baseclass.pm line 1328
    backend::baseclass::run_ssh_cmd(backend::svirt=HASH(0x562b7a84d618), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "username", "root", "hostname", "s390zl12.oqa.prg2.suse.org", "password", "Nots3cr3t-\@3-vt", ...) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 674
    consoles::sshVirtsh::run_cmd(consoles::sshVirtsh=HASH(0x562b79c875c8), "rsync --timeout='150' --stats -av '/var/lib/openqa/share/fact"..., "timeout", 900) called at /usr/lib/os-autoinst/consoles/sshVirtsh.pm line 396

Enforcing the timeout was actually meant to be an improvement (because we observed jobs getting stuck completely and only failing when running into the overall job timeout).

If 15 minutes are not enough you can now use the test variable SVIRT_ASSET_DOWNLOAD_TIMEOUT_M to increase the timeout.

Maybe there was also just a temporary infra/networking issue when these jobs were executed as the latest jobs have passed again so it.

Actions #20

Updated by mkittler 28 days ago

  • Copied to action #178324: [sporadic] svirt s390x tests sometimes time out while syncing assets auto_review:"LIBSSH2_ERROR_TIMEOUT[\s\S]*rsync":retry size:S added
Actions #21

Updated by mkittler 28 days ago

  • Status changed from Feedback to Resolved

I created a follow-up (#178324).

Actions

Also available in: Atom PDF