action #176076
closedopenQA Project (public) - coordination #176337: [saga][epic] Stable os-autoinst backends with stable command execution (no mistyping)
openQA Project (public) - coordination #125708: [epic] Future ideas for more stable non-qemu backends
[tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? size:M
0%
Description
Observation¶
openQA test in scenario sle-15-SP6-BCI-Updates-s390x-golang_stable_on_SLES_15-SP4_podman@s390x-kvm run into MAX_JOB_TIME exceeded in
bootloader_zkvm
This was brought up in Slack #eng-testing by @MDoucha in https://suse.slack.com/archives/C02CANHLANP/p1737624344362149
Reproducible¶
Seems to fail consistently looking at worker history like
https://openqa.suse.de/admin/workers/2660
Expected result¶
Last good: https://openqa.suse.de/tests/16506528 from 2025-01-20 09:14:53Z
Acceptance criteria¶
- AC1: Clear error output well before MAX_JOB_TIME about the actual cause
- AC2: Consistently stable tests again on s390x-kvm within the OSD infrastructure
Suggestions¶
- Try to reproduce manually or using developer mode
- Trigger investigation jobs with openqa-investigate to bisect between impact of product, tests, infrastructure
- Gather statistics, e.g. using
openqa-clone-job --repeat
- Ensure that the error output is clear and helps to distinguish between test issues and infrastructure issues
- Fix the actual underlying problem
Further details¶
Always latest result in this scenario: latest
While this is a test on OSD so far it looks like a generic openQA s390x problem so not assigning with tag "infra"
Updated by mkittler about 2 months ago
Looks like this happens across different worker slots and hosts:
openqa=> select count(jobs.id) as timeouted_job_count, min(t_finished) as first_t_finished, max(t_finished) as last_t_finished, string_agg(DISTINCT concat(jobs.ARCH, '@', jobs.MACHINE), ', ') as job_type, string_agg(DISTINCT concat(workers.host, ':', workers.instance), ', ') as worker_name, jobs.assigned_worker_id as worker_id from jobs join workers on workers.id = jobs.assigned_worker_id where t_finished >= '2025-01-22T00:00:00' and result = 'timeout_exceeded' group by jobs.assigned_worker_id order by count(jobs.id) desc limit 50;
timeouted_job_count | first_t_finished | last_t_finished | job_type | worker_name | worker_id
---------------------+---------------------+---------------------+-----------------------------+----------------+-----------
17 | 2025-01-22 01:54:32 | 2025-01-23 16:47:03 | s390x@s390x-kvm | worker33:14 | 2660
17 | 2025-01-22 00:30:57 | 2025-01-23 16:11:06 | s390x@s390x-kvm | worker32:16 | 2622
17 | 2025-01-22 00:48:40 | 2025-01-23 15:57:33 | s390x@s390x-kvm | worker32:20 | 2626
17 | 2025-01-22 01:33:55 | 2025-01-23 16:07:49 | s390x@s390x-kvm | worker32:22 | 2628
17 | 2025-01-22 01:18:54 | 2025-01-23 16:01:46 | s390x@s390x-kvm | worker32:23 | 2629
17 | 2025-01-22 00:54:17 | 2025-01-23 15:28:39 | s390x@s390x-kvm | worker33:6 | 2652
17 | 2025-01-22 00:54:58 | 2025-01-23 16:01:42 | s390x@s390x-kvm | worker33:10 | 2656
16 | 2025-01-22 00:40:25 | 2025-01-23 14:11:44 | s390x@s390x-kvm | worker32:19 | 2625
16 | 2025-01-22 00:08:51 | 2025-01-23 15:24:44 | s390x@s390x-kvm | worker32:18 | 2624
15 | 2025-01-22 01:03:08 | 2025-01-23 14:17:59 | s390x@s390x-kvm | worker33:9 | 2655
15 | 2025-01-22 00:21:12 | 2025-01-23 15:27:35 | s390x@s390x-kvm | worker33:8 | 2654
14 | 2025-01-22 00:28:29 | 2025-01-23 16:55:07 | s390x@s390x-kvm | worker32:24 | 2630
14 | 2025-01-22 03:02:01 | 2025-01-23 14:43:14 | s390x@s390x-kvm | worker32:21 | 2627
14 | 2025-01-22 06:45:07 | 2025-01-23 16:09:15 | s390x@s390x-kvm | worker33:12 | 2658
13 | 2025-01-22 01:50:01 | 2025-01-23 16:48:13 | s390x@s390x-kvm | worker33:13 | 2659
13 | 2025-01-22 03:02:02 | 2025-01-23 14:04:41 | s390x@s390x-kvm | worker32:17 | 2623
12 | 2025-01-22 05:07:03 | 2025-01-23 07:20:31 | s390x@s390x-kvm | worker32:25 | 2631
11 | 2025-01-22 03:56:07 | 2025-01-23 12:20:01 | s390x@s390x-kvm | worker33:7 | 2653
10 | 2025-01-22 03:51:46 | 2025-01-23 07:22:01 | s390x@s390x-kvm | worker33:15 | 2661
10 | 2025-01-22 03:02:08 | 2025-01-23 14:28:03 | s390x@s390x-kvm | worker33:11 | 2657
3 | 2025-01-22 02:32:59 | 2025-01-22 15:45:22 | s390x@s390x-kvm | worker32:8 | 3866
2 | 2025-01-23 04:31:55 | 2025-01-23 10:04:17 | s390x@s390x-kvm | worker32:9 | 3868
1 | 2025-01-23 12:05:37 | 2025-01-23 12:05:37 | x86_64@64bit-4gbram | worker35:19 | 2745
1 | 2025-01-23 09:43:28 | 2025-01-23 09:43:28 | x86_64@uefi | worker36:37 | 2803
1 | 2025-01-22 01:43:01 | 2025-01-22 01:43:01 | aarch64@aarch64-virtio | worker-arm1:21 | 3027
1 | 2025-01-22 02:25:27 | 2025-01-22 02:25:27 | aarch64@aarch64-virtio | worker-arm2:1 | 3047
1 | 2025-01-22 02:46:59 | 2025-01-22 02:46:59 | aarch64@aarch64-virtio | worker-arm2:4 | 3050
1 | 2025-01-22 02:14:32 | 2025-01-22 02:14:32 | aarch64@aarch64-virtio | worker-arm2:8 | 3054
1 | 2025-01-22 02:14:57 | 2025-01-22 02:14:57 | aarch64@aarch64-virtio | worker-arm2:22 | 3068
1 | 2025-01-23 15:10:56 | 2025-01-23 15:10:56 | x86_64@64bit | worker40:9 | 3095
1 | 2025-01-23 16:10:55 | 2025-01-23 16:10:55 | x86_64@uefi | worker33:43 | 3334
1 | 2025-01-23 12:37:28 | 2025-01-23 12:37:28 | x86_64@uefi | worker33:50 | 3433
1 | 2025-01-22 01:14:22 | 2025-01-22 01:14:22 | x86_64@64bit | worker31:55 | 3561
1 | 2025-01-22 00:32:39 | 2025-01-22 00:32:39 | x86_64@64bit | worker31:56 | 3562
1 | 2025-01-23 08:47:33 | 2025-01-23 08:47:33 | s390x@s390x-kvm | worker32:10 | 3869
1 | 2025-01-22 12:13:04 | 2025-01-22 12:13:04 | x86_64@uefi-staging | worker32:60 | 3885
1 | 2025-01-22 00:22:32 | 2025-01-22 00:22:32 | x86_64@64bit | worker34:57 | 3897
1 | 2025-01-22 13:46:28 | 2025-01-22 13:46:28 | x86_64@64bit | worker34:63 | 3921
1 | 2025-01-22 19:13:23 | 2025-01-22 19:13:23 | x86_64@64bit-4gbram | worker35:59 | 3946
1 | 2025-01-22 03:58:33 | 2025-01-22 03:58:33 | aarch64@ipmi-64bit-unarmed | sapworker1:41 | 3957
1 | 2025-01-22 07:26:02 | 2025-01-22 07:26:02 | s390x@s390x-kvm | worker32:11 | 2617
1 | 2025-01-22 15:48:47 | 2025-01-22 15:48:47 | x86_64@64bit-ipmi-large-mem | worker36:63 | 3993
1 | 2025-01-22 23:09:56 | 2025-01-22 23:09:56 | s390x@s390x-kvm | worker31:9 | 2575
(43 rows)
Updated by nicksinger about 2 months ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Started ~3 days ago (2025-01-20 11:00 UTC).
Worker at time of "hanging" executes:
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@14.service
├─112489 perl /usr/bin/isotovideo -d
├─112502 "/usr/bin/isotovideo: comman"
├─112515 "/usr/bin/isotovideo: autote"
├─112516 "/usr/bin/isotovideo: backen"
├─112897 /usr/lib/os-autoinst/videoencoder /var/lib/openqa/pool/14/video.ogv -n -x 1024 -y 768
├─112962 Xvnc -depth 16 -inetd -SecurityTypes None -ac :54025
├─112967 icewm -c /usr/lib/os-autoinst/consoles/icewm.cfg
├─112968 xterm -fn eurlatgr -fullscreen +sb -b 0 -bc -uc -bcf 200 -bcn 200 +bdc -xrm "xterm*boldMode: false" -xrm "xterm*foreground: rgb:aa/aa/aa" -xrm "xterm*background: rgb:00/00/00" -xrm "xterm*color0: rgb:00/00/00" -xrm "xterm*color1: rgb:aa/00/00" -xrm "xterm*color2: rgb:00/aa/00" -xrm "xterm*color3:>
├─112969 bash -c "TERM=xterm ssh -o TCPKeepAlive=yes -o ServerAliveCountMax=480 -o ServerAliveInterval=60 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o PubkeyAuthentication=no root@s390zl13.oqa.prg2.suse.org; read"
├─112970 ssh -o TCPKeepAlive=yes -o ServerAliveCountMax=480 -o ServerAliveInterval=60 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o PubkeyAuthentication=no root@s390zl13.oqa.prg2.suse.org
└─125570 /usr/bin/perl /usr/share/openqa/script/worker --instance 14
Log on worker33 contains a lot of:
Jan 23 16:47:03 worker33 worker[125570]: [debug] [pid:125570] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16554787/status"
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::exec(OpenQA::Worker=HASH(0x55f8be2eb290)) called at /usr/share/openqa/script/worker line 120
Jan 23 16:47:03 worker33 worker[125570]: Mojo::IOLoop::start(Mojo::IOLoop=HASH(0x55f8bc83cd08)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 369
Jan 23 16:47:03 worker33 worker[125570]: Mojo::Reactor::Poll::start(Mojo::Reactor::Poll=HASH(0x55f8baf06660)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 134
Jan 23 16:47:03 worker33 worker[125570]: Mojo::Reactor::Poll::one_tick(Mojo::Reactor::Poll=HASH(0x55f8baf06660)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 101
Jan 23 16:47:03 worker33 worker[125570]: Mojo::Reactor::Poll::_try(Mojo::Reactor::Poll=HASH(0x55f8baf06660), "I/O watcher", CODE(0x55f8c260f7a8), 0) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 60
Jan 23 16:47:03 worker33 worker[125570]: eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 23 16:47:03 worker33 worker[125570]: Mojo::IOLoop::Stream::__ANON__(Mojo::Reactor::Poll=HASH(0x55f8baf06660)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 23 16:47:03 worker33 worker[125570]: Mojo::IOLoop::Stream::_read(Mojo::IOLoop::Stream=HASH(0x55f8c2611f18)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 57
Jan 23 16:47:03 worker33 worker[125570]: Mojo::IOLoop::Stream::close(Mojo::IOLoop::Stream=HASH(0x55f8c2611f18)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 109
Jan 23 16:47:03 worker33 worker[125570]: Mojo::EventEmitter::emit(Mojo::IOLoop::Stream=HASH(0x55f8c2611f18), "close") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 27
Jan 23 16:47:03 worker33 worker[125570]: Mojo::IOLoop::Subprocess::__ANON__(Mojo::IOLoop::Stream=HASH(0x55f8c2611f18)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/EventEmitter.pm line 15
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::__ANON__(Mojo::IOLoop::Subprocess=HASH(0x55f8c2611ba0), "", "timeout") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Subprocess.pm line 85
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::_stop_step_5_1_upload(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "timeout", CODE(0x55f8c2612200)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 476
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::_invoke_after_result_upload(OpenQA::Worker::Job=HASH(0x55f8c08efc50), CODE(0x55f8c260d638)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 488
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::__ANON__() called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 120
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::_stop_step_5_2_upload(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "timeout", CODE(0x55f8c2612200)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 488
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::_upload_results(OpenQA::Worker::Job=HASH(0x55f8c08efc50), CODE(0x55f8c25ff768)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 533
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::_upload_results_step_0_prepare(OpenQA::Worker::Job=HASH(0x55f8c08efc50), CODE(0x55f8c25ff768)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 737
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::_read_result_file(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "", ARRAY(0x55f8c1111500)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 808
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::_read_module_result(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "boot_ltp") called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1148
Jan 23 16:47:03 worker33 worker[125570]: OpenQA::Worker::Job::_read_json_file(OpenQA::Worker::Job=HASH(0x55f8c08efc50), "result-boot_ltp.json") called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1186
Jan 23 16:47:03 worker33 worker[125570]: eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1130
Jan 23 16:47:03 worker33 worker[125570]: Mojo::File::slurp(Mojo::File=SCALAR(0x55f8c14bc4e0)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1130
Jan 23 16:47:03 worker33 worker[125570]: [debug] [pid:125570] Unable to read result-boot_ltp.json: Can't open file "/var/lib/openqa/pool/14/testresults/result-boot_ltp.json": No such file or directory at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 133.
but might be a red herring and not an actual problem.
openQA-worker from devel_openQA out-of-date (version 4.6.1737550093.3fc00710-lp156.7540.1 installed)
interesting job found: https://openqa.suse.de/tests/16554253 (5h and still running), https://openqa.suse.de/tests/16552399 (12h and still running)
zypper in -f ./packages/devel_openQA/x86_64/openQA-worker-4.6.1737414681.5f42d4fe-lp156.7539.1.x86_64.rpm ./packages/devel_openQA/x86_64/openQA-common-4.6.1737414681.5f42d4fe-lp156.7539.1.x86_64.rpm ./packages/devel_openQA/x86_64/openQA-client-4.6.1737414681.5f42d4fe-lp156.7539.1.x86_64.rpm
used to revert https://github.com/os-autoinst/openQA/commit/03a92190703390cf482742817ac42b791fae265a which was the most recent change to Job.pm but jobs still fail with the same symptoms -> seems not related to code-changes in openQA/worker.
Checking s390zl13 and specifically libvirtd (while starting the machine, VNC/dev mode stops to work) I see:
s390zl13:/var/log # journalctl -fu libvirtd
Jan 20 08:06:13 s390zl13 dnsmasq[25531]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Jan 20 08:06:13 s390zl13 dnsmasq-dhcp[25531]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Jan 20 08:33:14 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 08:33:14 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 08:34:14 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 08:34:14 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 09:23:18 s390zl13 libvirtd[179179]: g_hash_table_unref: assertion 'hash_table != NULL' failed
Jan 20 09:23:26 s390zl13 libvirtd[179179]: libvirt version: 9.0.0
Jan 20 09:23:26 s390zl13 libvirtd[179179]: hostname: s390zl13
Jan 20 09:23:26 s390zl13 libvirtd[179179]: stream had I/O failure
the g_hash_table_unref: assertion 'hash_table != NULL' failed
happened already earlier and is printed in green color so most likely not critical but the last line is red and looks suspiciously related time-wise. Restarting the service fills the log with:
Jan 23 19:50:37 s390zl13 systemd[1]: Stopping Virtualization daemon...
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: State 'stop-sigterm' timed out. Killing.
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Killing process 179179 (libvirtd) with signal SIGKILL.
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Main process exited, code=killed, status=9/KILL
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Failed with result 'timeout'.
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Unit process 25531 (dnsmasq) remains running after unit stopped.
Jan 23 19:52:07 s390zl13 systemd[1]: libvirtd.service: Unit process 25533 (dnsmasq) remains running after unit stopped.
Jan 23 19:52:07 s390zl13 systemd[1]: Stopped Virtualization daemon.
Jan 23 19:52:07 s390zl13 systemd[1]: Starting Virtualization daemon...
Jan 23 19:52:07 s390zl13 systemd[1]: Started Virtualization daemon.
Jan 23 19:52:08 s390zl13 dnsmasq[25531]: read /etc/hosts - 10 addresses
Jan 23 19:52:08 s390zl13 dnsmasq[25531]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Jan 23 19:52:08 s390zl13 dnsmasq-dhcp[25531]: read /var/lib/libvirt/dnsmasq/default.hostsfile
Jan 23 19:52:08 s390zl13 libvirtd[125601]: libvirt version: 9.0.0
Jan 23 19:52:08 s390zl13 libvirtd[125601]: hostname: s390zl13
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 224443
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 226044
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 228601
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 228611
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 228616
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 228684
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 238240
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 265699
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 265704
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 265719
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 277096
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 277159
Jan 23 19:52:08 s390zl13 libvirtd[125601]: internal error: Cannot find start time for pid 277169
Jan 23 19:52:08 s390zl13 libvirtd[125601]: End of file while reading data: Input/output error
[…]
https://openqa.suse.de/tests/16557442 and https://openqa.suse.de/tests/16557443 pass bootloader_start/bootloader_zkvm again (one on worker32 -> confirming again a non-issue in openQA/worker code).
I will try to find all jobs and restart them so s390 can work on the backlog over night (and maybe crash libvirtd again, lets see).
I'm only assign myself until all jobs are restarted but I don't plan to implement the mentioned error-reporting-improvements and rather create a (tools-infra) ticket about investigating the failing libvirtd on s390zl13.
Updated by nicksinger about 2 months ago
- Status changed from In Progress to Workable
- Assignee deleted (
nicksinger)
I've updated the worker packages on worker33 again to the most recent version and just found ~39 jobs with select j.id from jobs j join job_modules jm on j.id = jm.job_id where t_started >= '2025-01-20T00:00:00' and j.test not like '%:investigate:%' and j.clone_id is null and jm.name like 'bootloader_start' and j.result = 'timeout_exceeded';
on OSD which I restarted. I think most where restarted automatically already or just stuck forever building up a queue.
Marius query in https://progress.opensuse.org/issues/176076#note-2 indicates this could affect other hosts as well. I've only fixed zl13 and only temporary for now. Further investigation needed.
Updated by mkittler about 2 months ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler about 2 months ago · Edited
I checked workerconf.sls and it looks like we have 2 s390x hosts:
- s390zl12.oqa.prg2.suse.org: worker31:1-10, worker32:6-15
- s390zl13.oqa.prg2.suse.org: worker32:16-25, worker33:6-15
Judging by this query, both are technically are affected but only on s390zl13 the number of timeouted jobs was significant:
openqa=> select count(jobs.id) as timeouted_job_count, min(t_finished) as first_t_finished, max(t_finished) as last_t_finished, string_agg(DISTINCT concat(jobs.ARCH, '@', jobs.MACHINE), ', ') as job_type, string_agg(DISTINCT concat(workers.host, ':', workers.instance), ', ') as worker_name, jobs.assigned_worker_id as worker_id, (CASE WHEN string_agg(DISTINCT workers.host, '' ) = 'worker31' OR (string_agg(DISTINCT workers.host, '' ) = 'worker32' AND max(workers.instance) < 16) THEN 's390zl12' ELSE 's390zl13' END) as s390x_host from jobs join workers on workers.id = jobs.assigned_worker_id where t_finished >= '2025-01-22T00:00:00' and result = 'timeout_exceeded' group by jobs.assigned_worker_id order by count(jobs.id) desc;
timeouted_job_count | first_t_finished | last_t_finished | job_type | worker_name | worker_id | s390x_host
---------------------+---------------------+---------------------+-----------------------------+--------------------+-----------+------------
18 | 2025-01-22 00:54:58 | 2025-01-23 18:02:55 | s390x@s390x-kvm | worker33:10 | 2656 | s390zl13
18 | 2025-01-22 01:18:54 | 2025-01-23 18:33:05 | s390x@s390x-kvm | worker32:23 | 2629 | s390zl13
18 | 2025-01-22 00:48:40 | 2025-01-23 18:37:23 | s390x@s390x-kvm | worker32:20 | 2626 | s390zl13
18 | 2025-01-22 00:54:17 | 2025-01-23 18:29:59 | s390x@s390x-kvm | worker33:6 | 2652 | s390zl13
18 | 2025-01-22 01:33:55 | 2025-01-23 18:38:40 | s390x@s390x-kvm | worker32:22 | 2628 | s390zl13
18 | 2025-01-22 00:30:57 | 2025-01-23 18:12:18 | s390x@s390x-kvm | worker32:16 | 2622 | s390zl13
17 | 2025-01-22 01:54:32 | 2025-01-23 16:47:03 | s390x@s390x-kvm | worker33:14 | 2660 | s390zl13
17 | 2025-01-22 00:08:51 | 2025-01-23 17:29:05 | s390x@s390x-kvm | worker32:18 | 2624 | s390zl13
17 | 2025-01-22 00:40:25 | 2025-01-23 17:18:03 | s390x@s390x-kvm | worker32:19 | 2625 | s390zl13
16 | 2025-01-22 01:03:08 | 2025-01-23 17:41:56 | s390x@s390x-kvm | worker33:9 | 2655 | s390zl13
15 | 2025-01-22 06:45:07 | 2025-01-23 18:15:26 | s390x@s390x-kvm | worker33:12 | 2658 | s390zl13
15 | 2025-01-22 00:21:12 | 2025-01-23 15:27:35 | s390x@s390x-kvm | worker33:8 | 2654 | s390zl13
15 | 2025-01-22 03:02:01 | 2025-01-23 17:45:00 | s390x@s390x-kvm | worker32:21 | 2627 | s390zl13
14 | 2025-01-22 01:50:01 | 2025-01-23 19:11:51 | s390x@s390x-kvm | worker33:13 | 2659 | s390zl13
14 | 2025-01-22 00:28:29 | 2025-01-23 16:55:07 | s390x@s390x-kvm | worker32:24 | 2630 | s390zl13
13 | 2025-01-22 03:02:02 | 2025-01-23 14:04:41 | s390x@s390x-kvm | worker32:17 | 2623 | s390zl13
12 | 2025-01-22 03:56:07 | 2025-01-23 17:39:20 | s390x@s390x-kvm | worker33:7 | 2653 | s390zl13
12 | 2025-01-22 05:07:03 | 2025-01-23 07:20:31 | s390x@s390x-kvm | worker32:25 | 2631 | s390zl13
11 | 2025-01-22 03:02:08 | 2025-01-23 17:39:20 | s390x@s390x-kvm | worker33:11 | 2657 | s390zl13
10 | 2025-01-22 03:51:46 | 2025-01-23 07:22:01 | s390x@s390x-kvm | worker33:15 | 2661 | s390zl13
3 | 2025-01-22 02:32:59 | 2025-01-22 15:45:22 | s390x@s390x-kvm | worker32:8 | 3866 | s390zl12
2 | 2025-01-23 04:31:55 | 2025-01-23 10:04:17 | s390x@s390x-kvm | worker32:9 | 3868 | s390zl12
1 | 2025-01-22 12:13:04 | 2025-01-22 12:13:04 | x86_64@uefi-staging | worker32:60 | 3885 | s390zl13
1 | 2025-01-22 00:22:32 | 2025-01-22 00:22:32 | x86_64@64bit | worker34:57 | 3897 | s390zl13
1 | 2025-01-22 13:46:28 | 2025-01-22 13:46:28 | x86_64@64bit | worker34:63 | 3921 | s390zl13
1 | 2025-01-22 19:13:23 | 2025-01-22 19:13:23 | x86_64@64bit-4gbram | worker35:59 | 3946 | s390zl13
1 | 2025-01-22 03:58:33 | 2025-01-22 03:58:33 | aarch64@ipmi-64bit-unarmed | sapworker1:41 | 3957 | s390zl13
1 | 2025-01-24 01:21:01 | 2025-01-24 01:21:01 | x86_64@svirt-hyperv | qesapworker-prg4:2 | 2408 | s390zl13
1 | 2025-01-22 15:48:47 | 2025-01-22 15:48:47 | x86_64@64bit-ipmi-large-mem | worker36:63 | 3993 | s390zl13
1 | 2025-01-22 23:09:56 | 2025-01-22 23:09:56 | s390x@s390x-kvm | worker31:9 | 2575 | s390zl12
1 | 2025-01-22 07:26:02 | 2025-01-22 07:26:02 | s390x@s390x-kvm | worker32:11 | 2617 | s390zl12
1 | 2025-01-24 10:05:06 | 2025-01-24 10:05:06 | x86_64@64bit | worker34:27 | 2713 | s390zl13
1 | 2025-01-23 12:05:37 | 2025-01-23 12:05:37 | x86_64@64bit-4gbram | worker35:19 | 2745 | s390zl13
1 | 2025-01-24 10:07:02 | 2025-01-24 10:07:02 | x86_64@uefi | worker35:37 | 2763 | s390zl13
1 | 2025-01-23 09:43:28 | 2025-01-23 09:43:28 | x86_64@uefi | worker36:37 | 2803 | s390zl13
1 | 2025-01-22 01:43:01 | 2025-01-22 01:43:01 | aarch64@aarch64-virtio | worker-arm1:21 | 3027 | s390zl13
1 | 2025-01-22 02:25:27 | 2025-01-22 02:25:27 | aarch64@aarch64-virtio | worker-arm2:1 | 3047 | s390zl13
1 | 2025-01-22 02:46:59 | 2025-01-22 02:46:59 | aarch64@aarch64-virtio | worker-arm2:4 | 3050 | s390zl13
1 | 2025-01-22 02:14:32 | 2025-01-22 02:14:32 | aarch64@aarch64-virtio | worker-arm2:8 | 3054 | s390zl13
1 | 2025-01-22 02:14:57 | 2025-01-22 02:14:57 | aarch64@aarch64-virtio | worker-arm2:22 | 3068 | s390zl13
1 | 2025-01-23 15:10:56 | 2025-01-23 15:10:56 | x86_64@64bit | worker40:9 | 3095 | s390zl13
1 | 2025-01-24 09:05:03 | 2025-01-24 09:05:03 | x86_64@uefi | worker30:48 | 3233 | s390zl13
1 | 2025-01-23 16:10:55 | 2025-01-23 16:10:55 | x86_64@uefi | worker33:43 | 3334 | s390zl13
1 | 2025-01-24 10:18:53 | 2025-01-24 10:18:53 | x86_64@64bit | worker33:45 | 3339 | s390zl13
1 | 2025-01-23 12:37:28 | 2025-01-23 12:37:28 | x86_64@uefi | worker33:50 | 3433 | s390zl13
1 | 2025-01-23 19:11:31 | 2025-01-23 19:11:31 | x86_64@64bit-ipmi-large-mem | worker35:51 | 3514 | s390zl13
1 | 2025-01-22 01:14:22 | 2025-01-22 01:14:22 | x86_64@64bit | worker31:55 | 3561 | s390zl12
1 | 2025-01-22 00:32:39 | 2025-01-22 00:32:39 | x86_64@64bit | worker31:56 | 3562 | s390zl12
1 | 2025-01-23 19:08:06 | 2025-01-23 19:08:06 | s390x@s390x-kvm | worker32:6 | 3867 | s390zl12
1 | 2025-01-23 08:47:33 | 2025-01-23 08:47:33 | s390x@s390x-kvm | worker32:10 | 3869 | s390zl12
(50 rows)
Nick has already checked the "bad" host s390zl13 so I am now checking the "good" hsot s390zl12 for comparison.
I found the same g_hash_table_unref: assertion 'hash_table != NULL' failed
warning so it is probably a red herring, indeed. Restarting the service worked but the service was restarted 48 min ago anyway. There are no errors like internal error: Cannot find start time for pid …
in the whole journal so this might indeed be the culprit.
Updated by mkittler about 2 months ago
- Priority changed from Urgent to High
Not sure how to debug this IBM-specific device. Maybe we can consider it a hopefully non-repeating issue with that disk and only investigate the problem further if it happens again.
Updated by openqa_review about 2 months ago
- Due date set to 2025-02-08
Setting due date based on mean cycle time of SUSE QE Tools
Updated by okurz about 2 months ago
mkittler wrote in #note-8:
Not sure how to debug this IBM-specific device. Maybe we can consider it a hopefully non-repeating issue with that disk and only investigate the problem further if it happens again.
I agree. Let's assume the infra related part done and consider AC2 covered. Please focus on AC1, i.e. the dev-related improvement regarding ensuring that there is proper error feedback well before MAX_JOB_TIMEOUT
Updated by mkittler about 2 months ago · Edited
About not running into a timeout sooner:
Maybe the command virsh freecell --all
was hanging itself. Otherwise the SSH code in os-autoinst must be broken because it is already supposed to enforce a timeout of 5 minutes.
The test code looks like this:
record_info('free -h', $svirt->get_cmd_output('free -h'));
record_info('virsh freecell --all', $svirt->get_cmd_output('virsh freecell --all'));
record_info('virsh domstats', $svirt->get_cmd_output('virsh domstats'));
The first SSH command works (free -h
) and it only fails on the first virsh …
command. This speaks for the theory that virsh …
itself is hanging and not the connection.
On the post fail hook it also runs a virsh …
command and then complains that the SSH connection is broken and then (successfully) re-establishes it:
[2025-01-23T00:13:33.300023Z] [debug] [pid:118188] <<< backend::baseclass::run_ssh_cmd(cmd="virsh destroy openQA-SUT-9", wantarray=0, keep_open=1)
[2025-01-23T00:13:33.300049Z] [debug] [pid:118155] isotovideo handle_shutdown
[2025-01-23T00:13:33.300124Z] [debug] [pid:118188] <<< backend::baseclass::run_ssh(cmd="virsh destroy openQA-SUT-9", keep_open=1, wantarray=0)
[2025-01-23T00:13:33.300202Z] [debug] [pid:118155] stopping command server 118171 because test execution ended
[2025-01-23T00:13:33.300216Z] [debug] [pid:118188] <<< backend::baseclass::new_ssh_connection(keep_open=1, blocking=1, wantarray=0)
[2025-01-23T00:13:33.300289Z] [debug] [pid:118155] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20143/oNGCAPETeF0q9hN6/broadcast
[2025-01-23T00:13:33.300356Z] [debug] [pid:118188] Closing broken SSH connection (key:hostname=s390zl13.oqa.prg2.suse.org,username=root,port=22)
…
[2025-01-23T00:13:33.344307Z] [debug] [pid:118188] SSH connection to root@s390zl13.oqa.prg2.suse.org established
The "Closing broken SSH connection" means we cannot create a new channel via the SSH connection. So it cannot be reused and will be re-established which is actually even happening. So maybe it was the SSH connection after all?
Updated by mkittler about 2 months ago · Edited
I created a small test script which simulates the behavior of the SSH code in os-autoinst in case a command would take very long to produce any output:
#!/bin/perl
use Mojo::Base -strict, -signatures;
use Net::SSH2;
my $ssh2 = Net::SSH2->new(timeout => 5 * 1000);
$ssh2->connect('localhost')
or $ssh2->die_with_error;
$ssh2->check_hostkey('ask')
or $ssh2->die_with_error;
$ssh2->auth_agent($ENV{USER})
or $ssh2->die_with_error;
my $chan = $ssh2->channel()
or $ssh2->die_with_error;
print "BEFORE EXEC\n";
#$ssh2->blocking(0);
$chan->exec('sleep 10; ls')
or $ssh2->die_with_error;
print "AFTER EXEC\n";
$chan->send_eof;
#print while <$chan>;
print "AFTER SEND_EOF\n";
my ($stdout, $stderr) = ('', '');
until ($chan->eof) {
if (my ($o, $e) = $chan->read2) {
$stdout .= $o;
$stderr .= $e;
}
else {
$ssh2->die_with_error;
}
}
print "AFTER UNTIL\n";
print "EXIT CODE: ", $chan->exit_status, "\n";
print "STDOUT: $stdout\n";
print "STDERR: $stderr\n";
$chan->close;
This works after adding the else block in the until-loop in the sense that the script then returns in under 10 seconds despite the sleep. So maybe it makes sense to add this kind of error handling in os-autoinst as well. Note that exec
itself always returns fast (presumably as fast as the command can be send over network), it is not waiting for the command to complete (and thus potentially running into a timeout due to a long-running command).
This of course means our Net::SSH2 timeout will apply to individual reads. So if a command produces continuously output the code will still not run into a timeout. However, I suppose that's also what we want. I still hope this doesn't break any use cases as 5 minutes is not that long (e.g. a copy operation of a large file over network might take longer but not produce any output while it is ongoing).
Updated by mkittler about 2 months ago
Updated by mkittler about 2 months ago
- Status changed from In Progress to Feedback
Updated by mkittler about 2 months ago
The PR has been merged and I also put the s390x worker I used for testing back into production. (And I created https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/1357 to make this easier in the future.)
Updated by okurz about 2 months ago
- Due date deleted (
2025-02-08) - Status changed from Feedback to Resolved
Tests in production look good. As decided in the daily we are good to resolve.
Updated by mkittler about 2 months ago
- Status changed from In Progress to Feedback
This PR should fix the error handling: https://github.com/os-autoinst/os-autoinst/pull/2631
Updated by okurz about 2 months ago
Updated by okurz about 2 months ago
- Status changed from Feedback to In Progress
Updated by mkittler about 2 months ago
- Subject changed from [tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? to [tools] tests run into MAX_JOB_TIME exceeded in bootloader_zkvm, unable to login over ssh? size:M
Updated by mkittler about 2 months ago
Another fixup PR: https://github.com/os-autoinst/os-autoinst/pull/2635
Updated by mkittler about 2 months ago
- Status changed from In Progress to Feedback
Updated by mkittler about 1 month ago
The latest version was deployed this morning but now new tests have been scheduled so I created https://openqa.suse.de/tests/16624497 (openqa-clone-job --skip-download --within-instance https://openqa.suse.de/tests/16620303 _GROUP=0 {BUILD,TEST}+=-test-poo176076 PUBLISH_HDD_1= PUBLISH_PFLASH_VARS=
).
Updated by mkittler about 1 month ago
- Status changed from Feedback to Resolved
The test succeeded. With that the ACs are fulfilled so I am considering this ticket resolved.
Updated by nicksinger about 1 month ago
- Related to action #176868: Random SSH failures on s390x workers size:S added
Updated by tinita about 1 month ago
We are seeing the error handling now in some tests: #176868
https://openqa.suse.de/tests/16701073/logfile?filename=autoinst-log.txt#line-922
[2025-02-08T01:51:13.557920Z] [debug] [pid:126055] [run_ssh_cmd(virsh undefine --snapshots-metadata openQA-SUT-8)] stderr:
error: failed to get domain 'openQA-SUT-8'
[2025-02-08T01:51:13.558019Z] [debug] [pid:126055] [run_ssh_cmd(virsh undefine --snapshots-metadata openQA-SUT-8)] exit-code: 1