action #176076: [tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? size:M - openQA Tests (public) - openSUSE Project Management Tool

Actions

action #176076

closed

openQA Project (public) - coordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)

openQA Project (public) - coordination #125708: [epic] Future ideas for more stable non-qemu backends

[tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? size:M

Added by okurz 4 months ago. Updated 4 months ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Bugs in existing tests

Target version:

openQA Project (public) - Ready

Start date:

2025-01-23

Due date:

% Done:

Estimated time:

Difficulty:

Tags:

tests, s390x, reactive work

Description

Observation¶

openQA test in scenario sle-15-SP6-BCI-Updates-s390x-golang_stable_on_SLES_15-SP4_podman@s390x-kvm run into MAX_JOB_TIME exceeded in
bootloader_zkvm

This was brought up in Slack #eng-testing by @MDoucha in https://suse.slack.com/archives/C02CANHLANP/p1737624344362149

Reproducible¶

Seems to fail consistently looking at worker history like
https://openqa.suse.de/admin/workers/2660

Expected result¶

Last good: https://openqa.suse.de/tests/16506528 from 2025-01-20 09:14:53Z

Acceptance criteria¶

AC1: Clear error output well before MAX_JOB_TIME about the actual cause
AC2: Consistently stable tests again on s390x-kvm within the OSD infrastructure

Suggestions¶

Try to reproduce manually or using developer mode
Trigger investigation jobs with openqa-investigate to bisect between impact of product, tests, infrastructure
Gather statistics, e.g. using openqa-clone-job --repeat
Ensure that the error output is clear and helps to distinguish between test issues and infrastructure issues
Fix the actual underlying problem

Further details¶

Always latest result in this scenario: latest

While this is a test on OSD so far it looks like a generic openQA s390x problem so not assigning with tag "infra"

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz 4 months ago

Description updated (diff)

Actions

Copy link

Updated by mkittler 4 months ago

Looks like this happens across different worker slots and hosts:

openqa=> select count(jobs.id) as timeouted_job_count, min(t_finished) as first_t_finished, max(t_finished) as last_t_finished, string_agg(DISTINCT concat(jobs.ARCH, '@', jobs.MACHINE), ', ') as job_type, string_agg(DISTINCT concat(workers.host, ':', workers.instance), ', ') as worker_name, jobs.assigned_worker_id as worker_id from jobs join workers on workers.id = jobs.assigned_worker_id where t_finished >= '2025-01-22T00:00:00' and result = 'timeout_exceeded' group by jobs.assigned_worker_id order by count(jobs.id) desc limit 50;
 timeouted_job_count |  first_t_finished   |   last_t_finished   |          job_type           |  worker_name   | worker_id 
---------------------+---------------------+---------------------+-----------------------------+----------------+-----------
                  17 | 2025-01-22 01:54:32 | 2025-01-23 16:47:03 | s390x@s390x-kvm             | worker33:14    |      2660
                  17 | 2025-01-22 00:30:57 | 2025-01-23 16:11:06 | s390x@s390x-kvm             | worker32:16    |      2622
                  17 | 2025-01-22 00:48:40 | 2025-01-23 15:57:33 | s390x@s390x-kvm             | worker32:20    |      2626
                  17 | 2025-01-22 01:33:55 | 2025-01-23 16:07:49 | s390x@s390x-kvm             | worker32:22    |      2628
                  17 | 2025-01-22 01:18:54 | 2025-01-23 16:01:46 | s390x@s390x-kvm             | worker32:23    |      2629
                  17 | 2025-01-22 00:54:17 | 2025-01-23 15:28:39 | s390x@s390x-kvm             | worker33:6     |      2652
                  17 | 2025-01-22 00:54:58 | 2025-01-23 16:01:42 | s390x@s390x-kvm             | worker33:10    |      2656
                  16 | 2025-01-22 00:40:25 | 2025-01-23 14:11:44 | s390x@s390x-kvm             | worker32:19    |      2625
                  16 | 2025-01-22 00:08:51 | 2025-01-23 15:24:44 | s390x@s390x-kvm             | worker32:18    |      2624
                  15 | 2025-01-22 01:03:08 | 2025-01-23 14:17:59 | s390x@s390x-kvm             | worker33:9     |      2655
                  15 | 2025-01-22 00:21:12 | 2025-01-23 15:27:35 | s390x@s390x-kvm             | worker33:8     |      2654
                  14 | 2025-01-22 00:28:29 | 2025-01-23 16:55:07 | s390x@s390x-kvm             | worker32:24    |      2630
                  14 | 2025-01-22 03:02:01 | 2025-01-23 14:43:14 | s390x@s390x-kvm             | worker32:21    |      2627
                  14 | 2025-01-22 06:45:07 | 2025-01-23 16:09:15 | s390x@s390x-kvm             | worker33:12    |      2658
                  13 | 2025-01-22 01:50:01 | 2025-01-23 16:48:13 | s390x@s390x-kvm             | worker33:13    |      2659
                  13 | 2025-01-22 03:02:02 | 2025-01-23 14:04:41 | s390x@s390x-kvm             | worker32:17    |      2623
                  12 | 2025-01-22 05:07:03 | 2025-01-23 07:20:31 | s390x@s390x-kvm             | worker32:25    |      2631
                  11 | 2025-01-22 03:56:07 | 2025-01-23 12:20:01 | s390x@s390x-kvm             | worker33:7     |      2653
                  10 | 2025-01-22 03:51:46 | 2025-01-23 07:22:01 | s390x@s390x-kvm             | worker33:15    |      2661
                  10 | 2025-01-22 03:02:08 | 2025-01-23 14:28:03 | s390x@s390x-kvm             | worker33:11    |      2657
                   3 | 2025-01-22 02:32:59 | 2025-01-22 15:45:22 | s390x@s390x-kvm             | worker32:8     |      3866
                   2 | 2025-01-23 04:31:55 | 2025-01-23 10:04:17 | s390x@s390x-kvm             | worker32:9     |      3868
                   1 | 2025-01-23 12:05:37 | 2025-01-23 12:05:37 | x86_64@64bit-4gbram         | worker35:19    |      2745
                   1 | 2025-01-23 09:43:28 | 2025-01-23 09:43:28 | x86_64@uefi                 | worker36:37    |      2803
                   1 | 2025-01-22 01:43:01 | 2025-01-22 01:43:01 | aarch64@aarch64-virtio      | worker-arm1:21 |      3027
                   1 | 2025-01-22 02:25:27 | 2025-01-22 02:25:27 | aarch64@aarch64-virtio      | worker-arm2:1  |      3047
                   1 | 2025-01-22 02:46:59 | 2025-01-22 02:46:59 | aarch64@aarch64-virtio      | worker-arm2:4  |      3050
                   1 | 2025-01-22 02:14:32 | 2025-01-22 02:14:32 | aarch64@aarch64-virtio      | worker-arm2:8  |      3054
                   1 | 2025-01-22 02:14:57 | 2025-01-22 02:14:57 | aarch64@aarch64-virtio      | worker-arm2:22 |      3068
                   1 | 2025-01-23 15:10:56 | 2025-01-23 15:10:56 | x86_64@64bit                | worker40:9     |      3095
                   1 | 2025-01-23 16:10:55 | 2025-01-23 16:10:55 | x86_64@uefi                 | worker33:43    |      3334
                   1 | 2025-01-23 12:37:28 | 2025-01-23 12:37:28 | x86_64@uefi                 | worker33:50    |      3433
                   1 | 2025-01-22 01:14:22 | 2025-01-22 01:14:22 | x86_64@64bit                | worker31:55    |      3561
                   1 | 2025-01-22 00:32:39 | 2025-01-22 00:32:39 | x86_64@64bit                | worker31:56    |      3562
                   1 | 2025-01-23 08:47:33 | 2025-01-23 08:47:33 | s390x@s390x-kvm             | worker32:10    |      3869
                   1 | 2025-01-22 12:13:04 | 2025-01-22 12:13:04 | x86_64@uefi-staging         | worker32:60    |      3885
                   1 | 2025-01-22 00:22:32 | 2025-01-22 00:22:32 | x86_64@64bit                | worker34:57    |      3897
                   1 | 2025-01-22 13:46:28 | 2025-01-22 13:46:28 | x86_64@64bit                | worker34:63    |      3921
                   1 | 2025-01-22 19:13:23 | 2025-01-22 19:13:23 | x86_64@64bit-4gbram         | worker35:59    |      3946
                   1 | 2025-01-22 03:58:33 | 2025-01-22 03:58:33 | aarch64@ipmi-64bit-unarmed  | sapworker1:41  |      3957
                   1 | 2025-01-22 07:26:02 | 2025-01-22 07:26:02 | s390x@s390x-kvm             | worker32:11    |      2617
                   1 | 2025-01-22 15:48:47 | 2025-01-22 15:48:47 | x86_64@64bit-ipmi-large-mem | worker36:63    |      3993
                   1 | 2025-01-22 23:09:56 | 2025-01-22 23:09:56 | s390x@s390x-kvm             | worker31:9     |      2575
(43 rows)

Actions

Copy link

Updated by nicksinger 4 months ago

Status changed from New to In Progress
Assignee set to nicksinger

Started ~3 days ago (2025-01-20 11:00 UTC).

Worker at time of "hanging" executes:

     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@14.service
             ├─112489 perl /usr/bin/isotovideo -d
             ├─112502 "/usr/bin/isotovideo: comman"
             ├─112515 "/usr/bin/isotovideo: autote"
             ├─112516 "/usr/bin/isotovideo: backen"
             ├─112897 /usr/lib/os-autoinst/videoencoder /var/lib/openqa/pool/14/video.ogv -n -x 1024 -y 768
             ├─112962 Xvnc -depth 16 -inetd -SecurityTypes None -ac :54025
             ├─112967 icewm -c /usr/lib/os-autoinst/consoles/icewm.cfg
             ├─112968 xterm -fn eurlatgr -fullscreen +sb -b 0 -bc -uc -bcf 200 -bcn 200 +bdc -xrm "xterm*boldMode: false" -xrm "xterm*foreground: rgb:aa/aa/aa" -xrm "xterm*background: rgb:00/00/00" -xrm "xterm*color0: rgb:00/00/00" -xrm "xterm*color1: rgb:aa/00/00" -xrm "xterm*color2: rgb:00/aa/00" -xrm "xterm*color3:>
             ├─112969 bash -c "TERM=xterm ssh -o TCPKeepAlive=yes -o ServerAliveCountMax=480 -o ServerAliveInterval=60 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o PubkeyAuthentication=no root@s390zl13.oqa.prg2.suse.org; read"
             ├─112970 ssh -o TCPKeepAlive=yes -o ServerAliveCountMax=480 -o ServerAliveInterval=60 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o PubkeyAuthentication=no root@s390zl13.oqa.prg2.suse.org
             └─125570 /usr/bin/perl /usr/share/openqa/script/worker --instance 14

Log on worker33 contains a lot of:

Jan 23 16:47:03 worker33 worker[125570]: [debug] [pid:125570] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16554787/status"
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x55f8be2eb290)) called at /usr/share/openqa/script/worker line 120
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::IOLoop::start(Mojo::IOLoop=HASH(0x55f8bc83cd08)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 369
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::Reactor::Poll::start(Mojo::Reactor::Poll=HASH(0x55f8baf06660)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 134
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::Reactor::Poll::one_tick(Mojo::Reactor::Poll=HASH(0x55f8baf06660)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 101
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::Reactor::Poll::_try(Mojo::Reactor::Poll=HASH(0x55f8baf06660), "I/O watcher", CODE(0x55f8c260f7a8), 0) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 60
Jan 23 16:47:03 worker33 worker[125570]:         eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::IOLoop::Stream::__ANON__(Mojo::Reactor::Poll=HASH(0x55f8baf06660)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::IOLoop::Stream::_read(Mojo::IOLoop::Stream=HASH(0x55f8c2611f18)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 57
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::IOLoop::Stream::close(Mojo::IOLoop::Stream=HASH(0x55f8c2611f18)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 109
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::EventEmitter::emit(Mojo::IOLoop::Stream=HASH(0x55f8c2611f18), "close") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 27
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::IOLoop::Subprocess::__ANON__(Mojo::IOLoop::Stream=HASH(0x55f8c2611f18)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/EventEmitter.pm line 15
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::__ANON__(Mojo::IOLoop::Subprocess=HASH(0x55f8c2611ba0), "", "timeout") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Subprocess.pm line 85
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::_stop_step_5_1_upload(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "timeout", CODE(0x55f8c2612200)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 476
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::_invoke_after_result_upload(OpenQA::Worker::Job=HASH(0x55f8c08efc50), CODE(0x55f8c260d638)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 488
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::__ANON__() called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 120
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::_stop_step_5_2_upload(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "timeout", CODE(0x55f8c2612200)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 488
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::_upload_results(OpenQA::Worker::Job=HASH(0x55f8c08efc50), CODE(0x55f8c25ff768)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 533
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::_upload_results_step_0_prepare(OpenQA::Worker::Job=HASH(0x55f8c08efc50), CODE(0x55f8c25ff768)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 737
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::_read_result_file(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "", ARRAY(0x55f8c1111500)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 808
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::_read_module_result(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "boot_ltp") called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1148
Jan 23 16:47:03 worker33 worker[125570]:         OpenQA::Worker::Job::_read_json_file(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "result-boot_ltp.json") called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1186
Jan 23 16:47:03 worker33 worker[125570]:         eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1130
Jan 23 16:47:03 worker33 worker[125570]:         Mojo::File::slurp(Mojo::File=SCALAR(0x55f8c14bc4e0)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1130
Jan 23 16:47:03 worker33 worker[125570]: [debug] [pid:125570] Unable to read result-boot_ltp.json: Can't open file "/var/lib/openqa/pool/14/testresults/result-boot_ltp.json": No such file or directory at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 133.

but might be a red herring and not an actual problem.
openQA-worker from devel_openQA out-of-date (version 4.6.1737550093.3fc00710-lp156.7540.1 installed)

interesting job found: https://openqa.suse.de/tests/16554253 (5h and still running), https://openqa.suse.de/tests/16552399 (12h and still running)

zypper in -f ./packages/devel_openQA/x86_64/openQA-worker-4.6.1737414681.5f42d4fe-lp156.7539.1.x86_64.rpm ./packages/devel_openQA/x86_64/openQA-common-4.6.1737414681.5f42d4fe-lp156.7539.1.x86_64.rpm ./packages/devel_openQA/x86_64/openQA-client-4.6.1737414681.5f42d4fe-lp156.7539.1.x86_64.rpm

used to revert https://github.com/os-autoinst/openQA/commit/03a92190703390cf482742817ac42b791fae265a which was the most recent change to Job.pm but jobs still fail with the same symptoms -> seems not related to code-changes in openQA/worker.

Checking s390zl13 and specifically libvirtd (while starting the machine, VNC/dev mode stops to work) I see:

s390zl13:/var/log # journalctl -fu libvirtd
Jan 20 08:06:13 s390zl13 dnsmasq[25531]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Jan 20 08:06:13 s390zl13 dnsmasq-dhcp[25531]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Jan 20 08:33:14 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 08:33:14 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 08:34:14 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 08:34:14 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 09:23:18 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 09:23:26 s390zl13 libvirtd[179179]: libvirt version: 9.0.0
Jan 20 09:23:26 s390zl13 libvirtd[179179]: hostname: s390zl13
Jan 20 09:23:26 s390zl13 libvirtd[179179]: stream had I/O failure

the g_hash_table_unref: assertion 'hash_table != NULL' failed happened already earlier and is printed in green color so most likely not critical but the last line is red and looks suspiciously related time-wise. Restarting the service fills the log with:

Jan 23 19:50:37 s390zl13 systemd[1]: Stopping Virtualization daemon...
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: State 'stop-sigterm' timed out. Killing.
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Killing process 179179 (libvirtd) with signal SIGKILL.
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Main process exited, code=killed, status=9/KILL
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Failed with result 'timeout'.
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Unit process 25531 (dnsmasq) remains running after unit stopped.
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Unit process 25533 (dnsmasq) remains running after unit stopped.
Jan 23 19:52:07 s390zl13 systemd[1]: Stopped Virtualization daemon.
Jan 23 19:52:07 s390zl13 systemd[1]: Starting Virtualization daemon...
Jan 23 19:52:07 s390zl13 systemd[1]: Started Virtualization daemon.
Jan 23 19:52:08 s390zl13 dnsmasq[25531]: read /etc/hosts - 10 addresses
Jan 23 19:52:08 s390zl13 dnsmasq[25531]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Jan 23 19:52:08 s390zl13 dnsmasq-dhcp[25531]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Jan 23 19:52:08 s390zl13 libvirtd[125601]: libvirt version: 9.0.0
Jan 23 19:52:08 s390zl13 libvirtd[125601]: hostname: s390zl13
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 224443
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 226044
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 228601
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 228611
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 228616
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 228684
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 238240
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 265699
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 265704
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 265719
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 277096
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 277159
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 277169
Jan 23 19:52:08 s390zl13 libvirtd[125601]: End of file while reading data: Input/output error
[…]

https://openqa.suse.de/tests/16557442 and https://openqa.suse.de/tests/16557443 pass bootloader_start/bootloader_zkvm again (one on worker32 -> confirming again a non-issue in openQA/worker code).

I will try to find all jobs and restart them so s390 can work on the backlog over night (and maybe crash libvirtd again, lets see).

I'm only assign myself until all jobs are restarted but I don't plan to implement the mentioned error-reporting-improvements and rather create a (tools-infra) ticket about investigating the failing libvirtd on s390zl13.

Actions

Copy link

Updated by nicksinger 4 months ago

Status changed from In Progress to Workable
Assignee deleted (~~nicksinger~~)

I've updated the worker packages on worker33 again to the most recent version and just found ~39 jobs with select j.id from jobs j join job_modules jm on j.id = jm.job_id where t_started >= '2025-01-20T00:00:00' and j.test not like '%:investigate:%' and j.clone_id is null and jm.name like 'bootloader_start' and j.result = 'timeout_exceeded'; on OSD which I restarted. I think most where restarted automatically already or just stuck forever building up a queue.

Marius query in https://progress.opensuse.org/issues/176076#note-2 indicates this could affect other hosts as well. I've only fixed zl13 and only temporary for now. Further investigation needed.

Actions

Copy link

Updated by mkittler 4 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler 4 months ago · Edited

I checked workerconf.sls and it looks like we have 2 s390x hosts:

s390zl12.oqa.prg2.suse.org: worker31:1-10, worker32:6-15
s390zl13.oqa.prg2.suse.org: worker32:16-25, worker33:6-15

Judging by this query, both are technically are affected but only on s390zl13 the number of timeouted jobs was significant:

openqa=> select count(jobs.id) as timeouted_job_count, min(t_finished) as first_t_finished, max(t_finished) as last_t_finished, string_agg(DISTINCT concat(jobs.ARCH, '@', jobs.MACHINE), ', ') as job_type, string_agg(DISTINCT concat(workers.host, ':', workers.instance), ', ') as worker_name, jobs.assigned_worker_id as worker_id, (CASE WHEN string_agg(DISTINCT workers.host, '' ) = 'worker31' OR (string_agg(DISTINCT workers.host, '' ) = 'worker32' AND max(workers.instance) < 16) THEN 's390zl12' ELSE 's390zl13' END) as s390x_host from jobs join workers on workers.id = jobs.assigned_worker_id where t_finished >= '2025-01-22T00:00:00' and result = 'timeout_exceeded' group by jobs.assigned_worker_id order by count(jobs.id) desc;
 timeouted_job_count |  first_t_finished   |   last_t_finished   |          job_type           |    worker_name     | worker_id | s390x_host 
---------------------+---------------------+---------------------+-----------------------------+--------------------+-----------+------------
                  18 | 2025-01-22 00:54:58 | 2025-01-23 18:02:55 | s390x@s390x-kvm             | worker33:10        |      2656 | s390zl13
                  18 | 2025-01-22 01:18:54 | 2025-01-23 18:33:05 | s390x@s390x-kvm             | worker32:23        |      2629 | s390zl13
                  18 | 2025-01-22 00:48:40 | 2025-01-23 18:37:23 | s390x@s390x-kvm             | worker32:20        |      2626 | s390zl13
                  18 | 2025-01-22 00:54:17 | 2025-01-23 18:29:59 | s390x@s390x-kvm             | worker33:6         |      2652 | s390zl13
                  18 | 2025-01-22 01:33:55 | 2025-01-23 18:38:40 | s390x@s390x-kvm             | worker32:22        |      2628 | s390zl13
                  18 | 2025-01-22 00:30:57 | 2025-01-23 18:12:18 | s390x@s390x-kvm             | worker32:16        |      2622 | s390zl13
                  17 | 2025-01-22 01:54:32 | 2025-01-23 16:47:03 | s390x@s390x-kvm             | worker33:14        |      2660 | s390zl13
                  17 | 2025-01-22 00:08:51 | 2025-01-23 17:29:05 | s390x@s390x-kvm             | worker32:18        |      2624 | s390zl13
                  17 | 2025-01-22 00:40:25 | 2025-01-23 17:18:03 | s390x@s390x-kvm             | worker32:19        |      2625 | s390zl13
                  16 | 2025-01-22 01:03:08 | 2025-01-23 17:41:56 | s390x@s390x-kvm             | worker33:9         |      2655 | s390zl13
                  15 | 2025-01-22 06:45:07 | 2025-01-23 18:15:26 | s390x@s390x-kvm             | worker33:12        |      2658 | s390zl13
                  15 | 2025-01-22 00:21:12 | 2025-01-23 15:27:35 | s390x@s390x-kvm             | worker33:8         |      2654 | s390zl13
                  15 | 2025-01-22 03:02:01 | 2025-01-23 17:45:00 | s390x@s390x-kvm             | worker32:21        |      2627 | s390zl13
                  14 | 2025-01-22 01:50:01 | 2025-01-23 19:11:51 | s390x@s390x-kvm             | worker33:13        |      2659 | s390zl13
                  14 | 2025-01-22 00:28:29 | 2025-01-23 16:55:07 | s390x@s390x-kvm             | worker32:24        |      2630 | s390zl13
                  13 | 2025-01-22 03:02:02 | 2025-01-23 14:04:41 | s390x@s390x-kvm             | worker32:17        |      2623 | s390zl13
                  12 | 2025-01-22 03:56:07 | 2025-01-23 17:39:20 | s390x@s390x-kvm             | worker33:7         |      2653 | s390zl13
                  12 | 2025-01-22 05:07:03 | 2025-01-23 07:20:31 | s390x@s390x-kvm             | worker32:25        |      2631 | s390zl13
                  11 | 2025-01-22 03:02:08 | 2025-01-23 17:39:20 | s390x@s390x-kvm             | worker33:11        |      2657 | s390zl13
                  10 | 2025-01-22 03:51:46 | 2025-01-23 07:22:01 | s390x@s390x-kvm             | worker33:15        |      2661 | s390zl13
                   3 | 2025-01-22 02:32:59 | 2025-01-22 15:45:22 | s390x@s390x-kvm             | worker32:8         |      3866 | s390zl12
                   2 | 2025-01-23 04:31:55 | 2025-01-23 10:04:17 | s390x@s390x-kvm             | worker32:9         |      3868 | s390zl12
                   1 | 2025-01-22 12:13:04 | 2025-01-22 12:13:04 | x86_64@uefi-staging         | worker32:60        |      3885 | s390zl13
                   1 | 2025-01-22 00:22:32 | 2025-01-22 00:22:32 | x86_64@64bit                | worker34:57        |      3897 | s390zl13
                   1 | 2025-01-22 13:46:28 | 2025-01-22 13:46:28 | x86_64@64bit                | worker34:63        |      3921 | s390zl13
                   1 | 2025-01-22 19:13:23 | 2025-01-22 19:13:23 | x86_64@64bit-4gbram         | worker35:59        |      3946 | s390zl13
                   1 | 2025-01-22 03:58:33 | 2025-01-22 03:58:33 | aarch64@ipmi-64bit-unarmed  | sapworker1:41      |      3957 | s390zl13
                   1 | 2025-01-24 01:21:01 | 2025-01-24 01:21:01 | x86_64@svirt-hyperv         | qesapworker-prg4:2 |      2408 | s390zl13
                   1 | 2025-01-22 15:48:47 | 2025-01-22 15:48:47 | x86_64@64bit-ipmi-large-mem | worker36:63        |      3993 | s390zl13
                   1 | 2025-01-22 23:09:56 | 2025-01-22 23:09:56 | s390x@s390x-kvm             | worker31:9         |      2575 | s390zl12
                   1 | 2025-01-22 07:26:02 | 2025-01-22 07:26:02 | s390x@s390x-kvm             | worker32:11        |      2617 | s390zl12
                   1 | 2025-01-24 10:05:06 | 2025-01-24 10:05:06 | x86_64@64bit                | worker34:27        |      2713 | s390zl13
                   1 | 2025-01-23 12:05:37 | 2025-01-23 12:05:37 | x86_64@64bit-4gbram         | worker35:19        |      2745 | s390zl13
                   1 | 2025-01-24 10:07:02 | 2025-01-24 10:07:02 | x86_64@uefi                 | worker35:37        |      2763 | s390zl13
                   1 | 2025-01-23 09:43:28 | 2025-01-23 09:43:28 | x86_64@uefi                 | worker36:37        |      2803 | s390zl13
                   1 | 2025-01-22 01:43:01 | 2025-01-22 01:43:01 | aarch64@aarch64-virtio      | worker-arm1:21     |      3027 | s390zl13
                   1 | 2025-01-22 02:25:27 | 2025-01-22 02:25:27 | aarch64@aarch64-virtio      | worker-arm2:1      |      3047 | s390zl13
                   1 | 2025-01-22 02:46:59 | 2025-01-22 02:46:59 | aarch64@aarch64-virtio      | worker-arm2:4      |      3050 | s390zl13
                   1 | 2025-01-22 02:14:32 | 2025-01-22 02:14:32 | aarch64@aarch64-virtio      | worker-arm2:8      |      3054 | s390zl13
                   1 | 2025-01-22 02:14:57 | 2025-01-22 02:14:57 | aarch64@aarch64-virtio      | worker-arm2:22     |      3068 | s390zl13
                   1 | 2025-01-23 15:10:56 | 2025-01-23 15:10:56 | x86_64@64bit                | worker40:9         |      3095 | s390zl13
                   1 | 2025-01-24 09:05:03 | 2025-01-24 09:05:03 | x86_64@uefi                 | worker30:48        |      3233 | s390zl13
                   1 | 2025-01-23 16:10:55 | 2025-01-23 16:10:55 | x86_64@uefi                 | worker33:43        |      3334 | s390zl13
                   1 | 2025-01-24 10:18:53 | 2025-01-24 10:18:53 | x86_64@64bit                | worker33:45        |      3339 | s390zl13
                   1 | 2025-01-23 12:37:28 | 2025-01-23 12:37:28 | x86_64@uefi                 | worker33:50        |      3433 | s390zl13
                   1 | 2025-01-23 19:11:31 | 2025-01-23 19:11:31 | x86_64@64bit-ipmi-large-mem | worker35:51        |      3514 | s390zl13
                   1 | 2025-01-22 01:14:22 | 2025-01-22 01:14:22 | x86_64@64bit                | worker31:55        |      3561 | s390zl12
                   1 | 2025-01-22 00:32:39 | 2025-01-22 00:32:39 | x86_64@64bit                | worker31:56        |      3562 | s390zl12
                   1 | 2025-01-23 19:08:06 | 2025-01-23 19:08:06 | s390x@s390x-kvm             | worker32:6         |      3867 | s390zl12
                   1 | 2025-01-23 08:47:33 | 2025-01-23 08:47:33 | s390x@s390x-kvm             | worker32:10        |      3869 | s390zl12
(50 rows)

Nick has already checked the "bad" host s390zl13 so I am now checking the "good" hsot s390zl12 for comparison.

I found the same g_hash_table_unref: assertion 'hash_table != NULL' failed warning so it is probably a red herring, indeed. Restarting the service worked but the service was restarted 48 min ago anyway. There are no errors like internal error: Cannot find start time for pid … in the whole journal so this might indeed be the culprit.

Actions

Copy link

Updated by mkittler 4 months ago

Priority changed from Urgent to High

Not sure how to debug this IBM-specific device. Maybe we can consider it a hopefully non-repeating issue with that disk and only investigate the problem further if it happens again.

Actions

Copy link

Updated by openqa_review 4 months ago

Due date set to 2025-02-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#10

Updated by okurz 4 months ago

mkittler wrote in #note-8:

Not sure how to debug this IBM-specific device. Maybe we can consider it a hopefully non-repeating issue with that disk and only investigate the problem further if it happens again.

I agree. Let's assume the infra related part done and consider AC2 covered. Please focus on AC1, i.e. the dev-related improvement regarding ensuring that there is proper error feedback well before MAX_JOB_TIMEOUT

Actions

Copy link

#11

Updated by mkittler 4 months ago · Edited

About not running into a timeout sooner:

Maybe the command virsh freecell --all was hanging itself. Otherwise the SSH code in os-autoinst must be broken because it is already supposed to enforce a timeout of 5 minutes.

The test code looks like this:

    record_info('free -h', $svirt->get_cmd_output('free -h'));
    record_info('virsh freecell --all', $svirt->get_cmd_output('virsh freecell --all'));
    record_info('virsh domstats', $svirt->get_cmd_output('virsh domstats'));

The first SSH command works (free -h) and it only fails on the first virsh … command. This speaks for the theory that virsh … itself is hanging and not the connection.

On the post fail hook it also runs a virsh … command and then complains that the SSH connection is broken and then (successfully) re-establishes it:

[2025-01-23T00:13:33.300023Z] [debug] [pid:118188] <<< backend::baseclass::run_ssh_cmd(cmd="virsh destroy openQA-SUT-9", wantarray=0, keep_open=1)
[2025-01-23T00:13:33.300049Z] [debug] [pid:118155] isotovideo handle_shutdown
[2025-01-23T00:13:33.300124Z] [debug] [pid:118188] <<< backend::baseclass::run_ssh(cmd="virsh destroy openQA-SUT-9", keep_open=1, wantarray=0)
[2025-01-23T00:13:33.300202Z] [debug] [pid:118155] stopping command server 118171 because test execution ended
[2025-01-23T00:13:33.300216Z] [debug] [pid:118188] <<< backend::baseclass::new_ssh_connection(keep_open=1, blocking=1, wantarray=0)
[2025-01-23T00:13:33.300289Z] [debug] [pid:118155] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20143/oNGCAPETeF0q9hN6/broadcast
[2025-01-23T00:13:33.300356Z] [debug] [pid:118188] Closing broken SSH connection (key:hostname=s390zl13.oqa.prg2.suse.org,username=root,port=22)
…
[2025-01-23T00:13:33.344307Z] [debug] [pid:118188] SSH connection to root@s390zl13.oqa.prg2.suse.org established

The "Closing broken SSH connection" means we cannot create a new channel via the SSH connection. So it cannot be reused and will be re-established which is actually even happening. So maybe it was the SSH connection after all?

Actions

Copy link

#12

Updated by mkittler 4 months ago · Edited

I created a small test script which simulates the behavior of the SSH code in os-autoinst in case a command would take very long to produce any output:

#!/bin/perl

use Mojo::Base -strict, -signatures;
use Net::SSH2;

my $ssh2 = Net::SSH2->new(timeout => 5 * 1000);
$ssh2->connect('localhost')
  or $ssh2->die_with_error;
$ssh2->check_hostkey('ask')
  or $ssh2->die_with_error;
$ssh2->auth_agent($ENV{USER})
  or $ssh2->die_with_error;
my $chan = $ssh2->channel()
  or $ssh2->die_with_error;
print "BEFORE EXEC\n";
#$ssh2->blocking(0);
$chan->exec('sleep 10; ls')
  or $ssh2->die_with_error;
print "AFTER EXEC\n";
$chan->send_eof;
#print while <$chan>;
print "AFTER SEND_EOF\n";
my ($stdout, $stderr) = ('', '');
until ($chan->eof) {
    if (my ($o, $e) = $chan->read2) {
        $stdout .= $o;
        $stderr .= $e;
    }
    else {
        $ssh2->die_with_error;
    }
}
print "AFTER UNTIL\n";
print "EXIT CODE: ", $chan->exit_status, "\n";
print "STDOUT: $stdout\n";
print "STDERR: $stderr\n";
$chan->close;

This works after adding the else block in the until-loop in the sense that the script then returns in under 10 seconds despite the sleep. So maybe it makes sense to add this kind of error handling in os-autoinst as well. Note that exec itself always returns fast (presumably as fast as the command can be send over network), it is not waiting for the command to complete (and thus potentially running into a timeout due to a long-running command).

This of course means our Net::SSH2 timeout will apply to individual reads. So if a command produces continuously output the code will still not run into a timeout. However, I suppose that's also what we want. I still hope this doesn't break any use cases as 5 minutes is not that long (e.g. a copy operation of a large file over network might take longer but not produce any output while it is ongoing).

Actions

Copy link

#13

Updated by mkittler 4 months ago

PR: https://github.com/os-autoinst/os-autoinst/pull/2629

Actions

Copy link

#14

Updated by mkittler 4 months ago

Status changed from In Progress to Feedback

Actions

Copy link

#15

Updated by mkittler 4 months ago

The PR has been merged and I also put the s390x worker I used for testing back into production. (And I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1357 to make this easier in the future.)

Actions

Copy link

#16

Updated by okurz 4 months ago

Due date deleted (~~2025-02-08~~)
Status changed from Feedback to Resolved

Tests in production look good. As decided in the daily we are good to resolve.

Actions

Copy link

#18

Updated by mkittler 4 months ago

Status changed from In Progress to Feedback

This PR should fix the error handling: https://github.com/os-autoinst/os-autoinst/pull/2631

Actions

Copy link

#19

Updated by okurz 4 months ago

Parent task set to #125708

Actions

Copy link

#20

Updated by okurz 4 months ago

https://github.com/os-autoinst/os-autoinst/pull/2631 merged

Actions

Copy link

#21

Updated by okurz 4 months ago

Status changed from Feedback to In Progress

Actions

Copy link

#22

Updated by mkittler 4 months ago

Subject changed from [tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? to [tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? size:M

Actions

Copy link

#23

Updated by mkittler 4 months ago

Another fixup PR: https://github.com/os-autoinst/os-autoinst/pull/2635

Actions

Copy link

#24

Updated by mkittler 4 months ago

Status changed from In Progress to Feedback

Actions

Copy link

#25

Updated by mkittler 4 months ago

The latest version was deployed this morning but now new tests have been scheduled so I created https://openqa.suse.de/tests/16624497 (openqa-clone-job --skip-download --within-instance https://openqa.suse.de/tests/16620303 _GROUP=0 {BUILD,TEST}+=-test-poo176076 PUBLISH_HDD_1= PUBLISH_PFLASH_VARS=).

Actions

Copy link

#26

Updated by mkittler 4 months ago

Status changed from Feedback to Resolved

The test succeeded. With that the ACs are fulfilled so I am considering this ticket resolved.

Actions

Copy link

#27

Updated by nicksinger 4 months ago

Related to action #176868: Random SSH failures on s390x workers size:S added

Actions

Copy link

#28

Updated by tinita 4 months ago

We are seeing the error handling now in some tests: #176868
https://openqa.suse.de/tests/16701073/logfile?filename=autoinst-log.txt#line-922

[2025-02-08T01:51:13.557920Z] [debug] [pid:126055] [run_ssh_cmd(virsh undefine --snapshots-metadata openQA-SUT-8)] stderr:
  error: failed to get domain 'openQA-SUT-8'
  
[2025-02-08T01:51:13.558019Z] [debug] [pid:126055] [run_ssh_cmd(virsh undefine --snapshots-metadata openQA-SUT-8)] exit-code: 1

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #176076

[tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? size:M

Observation¶

Reproducible¶

Expected result¶

Acceptance criteria¶

Suggestions¶

Further details¶

Updated by okurz 4 months ago

Updated by mkittler 4 months ago

Updated by nicksinger 4 months ago

Updated by nicksinger 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by openqa_review 4 months ago

Updated by okurz 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by okurz 4 months ago

Updated by mkittler 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by okurz 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by nicksinger 4 months ago

Updated by tinita 4 months ago