action #170209
closedcoordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA
coordination #175515: [epic] incomplete jobs with "Failed to find an available port: Address already in use"
[sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M
0%
Description
Observation¶
I received an alert mail about an incomplete job: https://openqa.suse.de/tests/15996418
It fails with:
[2024-11-25T00:19:57.525877Z] [warn] [pid:103270] !!! : qemu-system-x86_64: -vnc :102,share=force-shared: Failed to find an availabale port: Address already in use
I asked in Slack and @tinita observed the same since "a few hours" - apparently all on worker39
https://openqa.suse.de/admin/workers/2898 shows that this started about 12 hours ago with this job: https://openqa.suse.de/tests/15995445
The qemu-system-x86_64
process 1963 is running since 12 hours with -vnc :102,share=force-shared
.
Acceptance criteria¶
- AC1: Affected jobs are restarted automatically
- AC2: We have a better understanding of situations where this can happen (if at all)
Suggestion¶
- Check one more time for bugs – also consider testing (!) – in the code for handling leftover QEMU processes
- Check one more time for bugs – also consider testing (!) – in terminating/killing the process group of isotovideo (in Mojo::…::ReadWriteProcess)
- Add/enable debug logging when starting/stopping isotovideo (maybe on ReadWriteProcess level)
- Consider starting/stopping isotovideo in a process group with low-level Perl code to replicate the error and investigate and potentially replace the problematic Mojo::ReadWriteProcess?
Updated by tinita 3 months ago
- Priority changed from Urgent to High
% ps aux | grep qemu
...
_openqa+ 1963 98.4 0.4 7061392 2206548 ? Sl Nov24 750:16 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 2048 -cpu qemu64 -netdev user,id=qanet0,hostfwd=tcp::39102-:39102 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -device qemu-xhci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :102,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/12/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/12/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on,discard=unmap -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-vars-overlay0,unit=1
% date
Mon Nov 25 11:10:00 AM UTC 2024
The process was started 22:28, so it must be from this test:
https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-201
I stopped the worker instance for now:
systemctl stop openqa-worker-auto-restart@12.service
So I think the urgency can be reduced.
Updated by jbaier_cz 3 months ago
- Related to action #166025: qemu process hanging indefinitely blocking the corresponding VNC port size:S added
Updated by mkittler 3 months ago
- Subject changed from "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 to "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by nicksinger 3 months ago
There was another occurrence of this today on worker39 reported by Martin via slack. The test failed similar to the previous report:
[2024-12-10T00:49:11.761843Z] [debug] [pid:121552] starting: /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 1024 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -boot order=c -device qemu-xhci -device usb-tablet -smp 16,threads=2 -enable-kvm -no-shutdown -vnc :103,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/13/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0
[2024-12-10T00:49:11.768614Z] [debug] [pid:121552] Waiting for 0 attempts
dmesg: read kernel buffer failed: Operation not permitted
[2024-12-10T00:49:12.073274Z] [debug] [pid:121552] Waiting for 1 attempts
[2024-12-10T00:49:12.073470Z] [info] [pid:121552] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
QEMU terminated before QMP connection could be established. Check for errors below
[2024-12-10T00:49:12.073805Z] [info] [pid:121552] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-12-10T00:49:12.074785Z] [debug] [pid:121552] Passing remaining frames to the video encoder
[2024-12-10T00:49:12.078140Z] [debug] [pid:121552] Waiting for video encoder to finalize the video
[2024-12-10T00:49:12.078216Z] [debug] [pid:121552] The built-in video encoder (pid 121742) terminated
[2024-12-10T00:49:12.079343Z] [debug] [pid:121552] QEMU: QEMU emulator version 8.2.7 (SUSE Linux Enterprise 15)
[2024-12-10T00:49:12.079413Z] [debug] [pid:121552] QEMU: Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers
[2024-12-10T00:49:12.079491Z] [warn] [pid:121552] !!! : qemu-system-x86_64: warning: This family of AMD CPU doesn't support hyperthreading(2)
[2024-12-10T00:49:12.079543Z] [debug] [pid:121552] QEMU: Please configure -smp options properly or try enabling topoext feature.
[2024-12-10T00:49:12.079582Z] [warn] [pid:121552] !!! : qemu-system-x86_64: -vnc :103,share=force-shared: Failed to find an available port: Address already in use
[2024-12-10T00:49:12.080349Z] [debug] [pid:121552] sending magic and exit
[2024-12-10T00:49:12.080650Z] [debug] [pid:120871] received magic close
[2024-12-10T00:49:12.089917Z] [debug] [pid:120871] backend process exited: 0
[2024-12-10T00:49:12.190817Z] [warn] [pid:120871] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104.
backend::driver::start_vm(backend::driver=HASH(0x55b9f6c2f468)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18
OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 109
OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 251
OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/bin/isotovideo line 183
eval {...} called at /usr/bin/isotovideo line 178
[2024-12-10T00:49:12.191180Z] [debug] [pid:120871] stopping command server 121330 because test execution ended through exception
[2024-12-10T00:49:12.391590Z] [debug] [pid:120871] done with command server
[2024-12-10T00:49:12.391689Z] [debug] [pid:120871] stopping autotest process 121362
[2024-12-10T00:49:12.592074Z] [debug] [pid:120871] done with autotest process
120871: EXIT 1
[2024-12-10T00:49:12.660935Z] [info] [pid:4058] Isotovideo exit status: 1
[2024-12-10T00:49:12.693330Z] [info] [pid:4058] +++ worker notes +++
[2024-12-10T00:49:12.693534Z] [info] [pid:4058] End time: 2024-12-10 00:49:12
[2024-12-10T00:49:12.693648Z] [info] [pid:4058] Result: died
[2024-12-10T00:49:12.701384Z] [info] [pid:121845] Uploading autoinst-log.txt
There are some more details in the slack thread but I didn't follow closely and don't completely understand what the "cause" of this was.
Updated by gpathak 3 months ago · Edited
As @tinita suggested to find the job that actually started the leftover qemu process.
I found that on worker39:13, the job https://openqa.suse.de/tests/16150344/ was cancelled and after looking at the logs https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8085 found that QEMU didn't get signal 15 from isotovideo
.
So, this was the first job on worker39:13 which didn't close the QEMU session gracefully and caused "Address Already in Use" error on further retries.
- https://openqa.suse.de/tests/16150572#line-217
- https://openqa.suse.de/tests/16150578#line-158
- https://openqa.suse.de/tests/16150595#line-175
- https://openqa.suse.de/tests/16150618#line-151
Marius mentioned that we already have an open ticket for the scenario when QEMU sometimes doesn't exit gracefully.
Maybe this is also related: #136238
Updated by tinita 3 months ago
I found that both processes causing the hanging qemu process have this in common, with a tiny difference:
https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8091
Fatal glibc error: malloc.c:4189 (_int_malloc): assertion failed: chunk_main_arena (fwd)
https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-5639
Fatal glibc error: malloc.c:4172 (_int_malloc): assertion failed: chunk_main_arena (bck->bk)
In both cases we have ::: backend::driver::_collect_orphan: Driver backend collected unknown process
after that, but that can't be the qemu process, because we know it was still running, causing the actual problem.
Updated by nicksinger 2 months ago
this time we have something on openqaworker18. The last good job was:
- https://openqa.suse.de/tests/16218776
- https://openqa.suse.de/tests/16219186 <- first incomplete but without a "Failed to find an available port"-message
- https://openqa.suse.de/tests/16219574 <- first incomplete with this message
what I noticed while trying to understand why 16219186 in-completed:
[2024-12-17T19:00:35.155667Z] [info] [pid:38979] Preparing cgroup to start isotovideo
[2024-12-17T19:00:35.156204Z] [warn] [pid:38979] Disabling cgroup usage because cgroup creation failed: mkdir /sys/fs/cgroup/systemd: Permission denied at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 85.
[2024-12-17T19:00:35.156268Z] [info] [pid:38979] You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROUP_FS.
but not sure if related or not.
Updated by nicksinger 2 months ago
Another broken instance on worker18: https://openqa.suse.de/tests/16236652 - I might reboot worker18 if nobody wants to look into it right now (asked in slack)
Updated by okurz 2 months ago
- Subject changed from "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M to [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M
- Priority changed from Normal to High
Also happened in https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3573426#L52 and there is no automatic retrying yet so we should handle this with higher priority. I also added an "auto_review" expression.
Updated by livdywan about 2 months ago
- Status changed from Workable to In Progress
- Assignee set to livdywan
I'm investigating this atm
Updated by openqa_review about 2 months ago
- Due date set to 2025-01-21
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita about 2 months ago · Edited
To search for recent failures like this you can use:
select concat('https://openqa.suse.de/t', id), id, assigned_worker_id, t_started, state, result, test, reason from jobs where reason = 'backend died: QEMU exited unexpectedly, see log for details' order by t_started desc limit 20;
It can theoretically be a different error, so the log still needs to be checked.
And we should kill the qemu process on that worker, so that the next job on this worker instance won't run into the same issue.
Updated by livdywan about 2 months ago
- Status changed from In Progress to Workable
- Assignee deleted (
livdywan)
I'm afraid there is something more important on my mind.
Updated by mkittler about 2 months ago
- Status changed from Workable to In Progress
- Assignee set to mkittler
Updated by mkittler about 2 months ago · Edited
We saw this again on multiple slots. Example: https://openqa.suse.de/admin/workers/3377
Notes from my investigation:
Normally the worker slot should at least show up as broken but here it doesn't (it is idling).
systemctl shows the leftover process quite clearly:
openqa-worker-auto-restart@23.service - openQA Worker #23
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
Active: active (running) since Sun 2025-01-12 05:18:31 UTC; 1 day 9h ago
Process: 15228 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/23 (code=exited, status=0/SUCCESS)
Main PID: 15237 (worker)
Tasks: 7 (limit: 19660)
CPU: 7h 31min 49.917s
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@23.service
├─15237 /usr/bin/perl /usr/share/openqa/script/worker --instance 23
└─69998 /usr/bin/qemu-system-x86_64 -device virtio-vga,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -devi>
The leftover process can just be terminated, e.g. sudo kill 69998
.
Link to job that caused the leftover process: https://openqa.suse.de/tests/16427146
Relevant worker logs:
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [info] [pid:73422] Uploading serial_terminal.txt
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [debug] [pid:73422] Uploading artefact serial_terminal.txt
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [info] [pid:73422] Uploading serial_terminal_user.txt
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [debug] [pid:73422] Uploading artefact serial_terminal_user.txt
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Job 16427146 stopped as incomplete
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Unable to read result-check_network.json: Can't open file "/var/lib/openqa/pool/23/testresults/result-check_network.json": No such file or directory at />
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::File::slurp(Mojo::File=SCALAR(0x562378437cd8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1135
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1135
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::_read_json_file(OpenQA::Worker::Job=HASH(0x562379180a18), "result-check_network.json") called at /usr/share/openqa/script/../lib/OpenQA/Worker/J>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::_read_module_result(OpenQA::Worker::Job=HASH(0x562379180a18), "check_network") called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm li>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::_read_result_file(OpenQA::Worker::Job=HASH(0x562379180a18), "", ARRAY(0x5623791c2740)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/J>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::_upload_results_step_0_prepare(OpenQA::Worker::Job=HASH(0x562379180a18), CODE(0x5623783eb540)) called at /usr/share/openqa/script/../lib/OpenQA/>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::_upload_results(OpenQA::Worker::Job=HASH(0x562379180a18), CODE(0x5623783eb540)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm l>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::_stop_step_5_2_upload(OpenQA::Worker::Job=HASH(0x562379180a18), "died", CODE(0x56237930eda8)) called at /usr/share/openqa/script/../lib/OpenQA/W>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::__ANON__() called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 120
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::_invoke_after_result_upload(OpenQA::Worker::Job=HASH(0x562379180a18), CODE(0x562379312558)) called at /usr/share/openqa/script/../lib/OpenQA/Wor>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::_stop_step_5_1_upload(OpenQA::Worker::Job=HASH(0x562379180a18), "died", CODE(0x56237930eda8)) called at /usr/share/openqa/script/../lib/OpenQA/W>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::Job::__ANON__(Mojo::IOLoop::Subprocess=HASH(0x56237930b258), "", "died") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Subprocess.pm line 85
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::IOLoop::Subprocess::__ANON__(Mojo::IOLoop::Stream=HASH(0x56237930f270)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/EventEmitter.pm line 15
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::EventEmitter::emit(Mojo::IOLoop::Stream=HASH(0x56237930f270), "close") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 27
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::IOLoop::Stream::close(Mojo::IOLoop::Stream=HASH(0x56237930f270)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 109
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::IOLoop::Stream::_read(Mojo::IOLoop::Stream=HASH(0x56237930f270)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 57
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::IOLoop::Stream::__ANON__(Mojo::Reactor::Poll=HASH(0x56237222f690)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::Reactor::Poll::_try(Mojo::Reactor::Poll=HASH(0x56237222f690), "I/O watcher", CODE(0x56237930d178), 0) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::Reactor::Poll::one_tick(Mojo::Reactor::Poll=HASH(0x56237222f690)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 101
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::Reactor::Poll::start(Mojo::Reactor::Poll=HASH(0x56237222f690)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 134
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: Mojo::IOLoop::start(Mojo::IOLoop=HASH(0x562373b66ac8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 369
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: OpenQA::Worker::exec(OpenQA::Worker=HASH(0x562375602398)) called at /usr/share/openqa/script/worker line 120
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: [debug] [pid:15237] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16427146/status"
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/glibc_locale-8.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-8.png as c62e2548b2e63f5b68ffe0f91e6bc96b
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/.thumbs/glibc_locale-8.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-8.png as c62e2548b2e63f5b68ffe0f91e6bc96b
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/glibc_locale-1.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-1.png as d804d6cb366b2158b5210b343401c366
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/.thumbs/glibc_locale-1.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-1.png as d804d6cb366b2158b5210b343401c366
…
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/.thumbs/glibc_locale-3.png
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-3.png as 9fb8d6a2417669f089c1ce447b73e1ff
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-9.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-21.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-20.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-25.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-24.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-23.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-7.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-27.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-18.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-16.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-19.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-22.txt
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Upload concluded (at glibc_locale)
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16427146/set_done?reason=died%3A+terminated+prematurely%2C+see+log+output+for+details&result=incom>
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Job 16427146 from openqa.suse.de finished - reason: died
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Cleaning up for next job
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Accepting job 16427266 from openqa.suse.de.
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Setting job 16427266 from openqa.suse.de up
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Preparing Mojo::IOLoop::ReadWriteProcess::Session
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [info] [pid:15237] +++ setup notes +++
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [info] [pid:15237] Running on qesapworker-prg5:23 (Linux 6.4.0-150600.23.30-default #1 SMP PREEMPT_DYNAMIC Sat Dec 7 08:37:53 UTC 2024 (8c25a0a) x86_64)
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Job settings:
…
Relevant os-autoinst logs:
[2025-01-13T14:07:03.162048Z] [debug] [pid:69720] no change: 86.6s
[2025-01-13T14:07:03.510529Z] [debug] [pid:69720] no match: 86.6s, best candidate: grub2-20241114 (0.18)
[2025-01-13T14:07:07.026919Z] [warn] [pid:69720] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 2.86 seconds for 54 candidate needles - make your needles more specific
[2025-01-13T14:07:07.027016Z] [debug] [pid:69720] no match: 85.6s, best candidate: grub2-20241114 (0.18)
[2025-01-13T14:07:07.028906Z] [debug] [pid:69720] considering VNC stalled, no update for 4.52 seconds
[2025-01-13T14:07:07.038750Z] [debug] [pid:69720] no change: 82.7s
[2025-01-13T14:07:07.378976Z] [debug] [pid:69720] no match: 82.7s, best candidate: grub2-20241114 (0.18)
[2025-01-13T14:07:07.379278Z] [debug] [pid:69720] pointer type 1 0 1024 768 -257
[2025-01-13T14:07:07.379322Z] [debug] [pid:69720] led state 0 1 1 -261
corrupted size vs. prev_size
Error while decoding stream #0:0: Invalid data found when processing input
[2025-01-13T14:07:11.182059Z] [debug] [pid:69511] backend process exited: 0
[2025-01-13T14:07:11.183220Z] [debug] [pid:69511] isotovideo handle_shutdown
[2025-01-13T14:07:11.183480Z] [debug] [pid:69511] stopping command server 69521 because test execution ended
[2025-01-13T14:07:11.183704Z] [debug] [pid:69511] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20233/wxFqLKHC9jtwtMlN/broadcast
[2025-01-13T14:07:11.184467Z] [info] [pid:69511] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 69913 and exit status: 0
[2025-01-13T14:07:11.205481Z] [debug] [pid:69511] commands process exited: 0
[2025-01-13T14:07:11.305912Z] [debug] [pid:69511] done with command server
[2025-01-13T14:07:11.306221Z] [debug] [pid:69511] stopping autotest process 69530
[2025-01-13T14:07:11.306715Z] [debug] [pid:69530] autotest received signal TERM, saving results of current test before exiting
[2025-01-13T14:07:11.325243Z] [debug] [pid:69511] [autotest] process exited: 1
[2025-01-13T14:07:11.425890Z] [debug] [pid:69511] done with autotest process
[2025-01-13T14:07:11.426033Z] [debug] [pid:69511] isotovideo failed
[2025-01-13T14:07:11.428938Z] [debug] [pid:69511] stopping backend process 69720
[2025-01-13T14:07:11.429133Z] [debug] [pid:69511] done with backend process
69511: EXIT 1
[2025-01-13T14:07:11.521543Z] [info] [pid:15237] Isotovideo exit status: 1
[2025-01-13T14:07:11.546777Z] [info] [pid:15237] +++ worker notes +++
[2025-01-13T14:07:11.547021Z] [info] [pid:15237] End time: 2025-01-13 14:07:11
[2025-01-13T14:07:11.547142Z] [info] [pid:15237] Result: died
[2025-01-13T14:07:11.554374Z] [info] [pid:73422] Uploading image_info-basic_health_check.txt
[2025-01-13T14:07:11.581017Z] [info] [pid:73422] Uploading image_info-dmesg.txt
[2025-01-13T14:07:11.601762Z] [info] [pid:73422] Uploading image_info-ip-addr-show.txt
[2025-01-13T14:07:11.615916Z] [info] [pid:73422] Uploading image_info-journal.txt
[2025-01-13T14:07:11.681871Z] [info] [pid:73422] Uploading image_info-loadavg.txt
[2025-01-13T14:07:11.702895Z] [info] [pid:73422] Uploading image_info-problem_detection_logs.tar.xz
frame= 1353 fps=1.4 q=0.0 Lsize= 2068kB time=00:00:56.33 bitrate= 300.7kbits/s speed=0.059x
video:2058kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.462934%
[2025-01-13T14:07:11.726346Z] [info] [pid:73422] Uploading image_info-psaxf.txt
[2025-01-13T14:07:11.743446Z] [info] [pid:73422] Uploading image_info-resolv-conf.txt
[2025-01-13T14:07:11.760713Z] [info] [pid:73422] Uploading image_info-sysconfig.tar.gz
[2025-01-13T14:07:11.780262Z] [info] [pid:73422] Uploading image_info-systemctl.txt
[2025-01-13T14:07:11.805747Z] [info] [pid:73422] Uploading image_info-systemctl_status.txt
[2025-01-13T14:07:11.830682Z] [info] [pid:73422] Uploading image_info-systemctl_unit-files.txt
[2025-01-13T14:07:11.850402Z] [info] [pid:73422] Uploading journal_check-audit.tar
[2025-01-13T14:07:11.870232Z] [info] [pid:73422] Uploading journal_check-full_journal.log
[2025-01-13T14:07:11.933653Z] [info] [pid:73422] Uploading pam-results.tap
[2025-01-13T14:07:11.961788Z] [info] [pid:73422] Uploading video.webm
[2025-01-13T14:07:12.037806Z] [info] [pid:73422] Uploading autoinst-log.txt
Updated by mkittler about 2 months ago
This needs to be fixed in RWP (unless we want to inherit from its package and reimplement is_running
and stop
on our side): https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/61
I also came up with a small improvement for openQA: https://github.com/os-autoinst/openQA/pull/6110
Updated by mkittler about 2 months ago · Edited
- Status changed from In Progress to Feedback
Both PRs have been merged.
I also created https://build.opensuse.org/package/show/devel:openQA/perl-Mojo-IOLoop-ReadWriteProcess and deployed it on OSD workers via sudo salt -C 'G@roles:worker' cmd.run 'zypper in --allow-vendor-change perl-Mojo-IOLoop-ReadWriteProcess'
. Judging by the output of sudo salt -C 'G@roles:worker' cmd.run "grep -- '-\$pid' /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm"
the new code is in place.
Note that the package hasn't been published for aarch64 yet so if we see it on that architecture that wouldn't necessarily mean that the fix didn't work.
So let's see whether this fixes the issue.
Updated by mkittler about 1 month ago
We had to revert changes due to a race condition.
This change will fix it: https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/62
I'll continue deploying it later but only on a single worker.
I'll also try to improve tests in Mojo-RWP.
Updated by okurz about 1 month ago
- Related to action #175464: jobs incomplete with auto_review:"setup failure: isotovideo can not be started" added
Updated by mkittler about 1 month ago · Edited
This is also a bit problematic when it comes to stopping the process. That means if we call stop
before setpgrp(0, 0)
runs (or setpgrp(0, 0)
does not run after all for some reason) the process cannot be stopped. We always had this problem (also before my recent changes) so supposedly it doesn't really matter in practice. However, here's a daft for how fixing this: https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/64
Updated by mkittler about 1 month ago
A version of RWP I will deploy tomorrow on a single production test host to see whether my latest changed work in production: https://build.opensuse.org/package/show/home:mkittler:branches:devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess
Updated by mkittler about 1 month ago
As mentioned on Slack:
I now deployed a fixed version of the change on openqaworker14.qa.suse.cz because that was one of the previously affected machines. If you see any issues, simply disable the repository home_mkittler_test_poo170209 I added there and install the regular version of RWP again via zypper in --allow-vendor-change perl-Mojo-IOLoop-ReadWriteProcess. (Of course I'll handle that, just in case I'm not around.)
Updated by mkittler about 1 month ago
So far the situation looks good on openqaworker14:
openqa=> select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqaworker14' and result = 'incomplete' and t_finished >= '2025-01-16T12:00:00' order by t_finished;
id | t_finished | result | reason
----------+---------------------+------------+-------------------------------------------------------------
16479164 | 2025-01-16 14:01:27 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16479166 | 2025-01-16 14:02:29 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16479171 | 2025-01-16 14:03:26 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16479176 | 2025-01-16 14:04:46 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16479177 | 2025-01-16 14:05:50 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16479178 | 2025-01-16 14:06:53 | incomplete | backend died: QEMU exited unexpectedly, see log for details
16479179 | 2025-01-16 14:07:57 | incomplete | backend died: QEMU exited unexpectedly, see log for details
(7 rows)
These incompletes were caused by a leftover process on slot 14 which I have cleaned up only now. The leftover was from https://openqa.suse.de/tests/16472470 which ran 12 hours ago which was before I deployed my changes.
That means there were so far no relevant leftover processes and also no regressions. (I also checked failed jobs but also couldn't find any relevant jobs.)
Updated by rfan1 about 1 month ago
I can hit the issue again, https://openqa.suse.de/tests/16486907
Can you please help double check?
Updated by mkittler about 1 month ago
This slot looks fine. I assume you have already killed the leftover process yourself then.
I ran the query from yesterday again to check on openqaworker14 and it looks good. So I'm going to deploy my change everywhere again.
Updated by mkittler about 1 month ago
Considering there are other problems with OSD right now we decided to deploy the fix everywhere only next week.
Until then, here's how you can workaround it:
Check whether there's a leftover process:
martchus@qesapworker-prg6:~> sudo systemctl status openqa-worker-auto-restart@8
● openqa-worker-auto-restart@8.service - openQA Worker #8
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
Active: active (running) since Sun 2025-01-12 05:18:28 UTC; 1 day 9h ago
Process: 21274 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/8 (code=exited, status=0/SUCCESS)
Main PID: 21280 (worker)
Tasks: 9 (limit: 19660)
CPU: 1h 45min 11.987s
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@8.service
├─21280 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
├─23645 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device inte>
└─75353 swtpm socket --tpmstate dir=/tmp/mytpm8 --ctrl type=unixio,path=/tmp/mytpm8/swtpm-sock --log level=20 -d
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: Mojo::Reactor::Poll::_try(Mojo::Reactor::Poll=HASH(0x555de2562340), "I/O watcher", CODE(0x555de6399db8), 0) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/>
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: Mojo::Reactor::Poll::one_tick(Mojo::Reactor::Poll=HASH(0x555de2562340)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 101
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: Mojo::Reactor::Poll::start(Mojo::Reactor::Poll=HASH(0x555de2562340)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 134
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: Mojo::IOLoop::start(Mojo::IOLoop=HASH(0x555de3e99fe8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 369
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: OpenQA::Worker::exec(OpenQA::Worker=HASH(0x555de5a1a798)) called at /usr/share/openqa/script/worker line 120
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] Upload concluded (no current module)
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16436185/set_done?reason=backend+died%3A+QEMU+exited+unexpectedly%2C+see+log+for+details&result=in>
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] Job 16436185 from openqa.suse.de finished - reason: died
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] Cleaning up for next job
Terminate it:
martchus@qesapworker-prg6:~> sudo kill 23645 75353
Run kill again to see if it worked:
martchus@qesapworker-prg6:~> sudo kill 23645 75353
kill: sending signal to 23645 failed: No such process
kill: sending signal to 75353 failed: No such process
Updated by mkittler about 1 month ago
Everything looks still good on openqaworker14 so I'm going to bring in my changes again on all workers.
So I ran osc copypac home:mkittler:branches:devel:languages:perl perl-Mojo-IOLoop-ReadWriteProcess devel:openQA
to copy the link I have on my home project to devel:openQA
and will do an update of all OSD workers as soon as the build has finished.
Updated by okurz about 1 month ago
- Due date changed from 2025-01-21 to 2025-01-24
Given that we decided to explicitly delay the deployment we should give some more days for monitoring. Updating due-date accordingly.
Updated by mkittler about 1 month ago · Edited
Considering https://openqa.suse.de/tests/16517840 this might not fix all cases or only one part of the problem:
- The first occurrence of the issue on that slot was https://openqa.suse.de/tests/16516784 and before ran https://openqa.suse.de/tests/16516249 which probably caused the leftover process. So all jobs ran when the fix was already in place.
- Looking at the worker slot right now I cannot see a leftover process, though.
- So maybe there was another reason why the port was blocked (and not a leftover process)? That's not very likely.
- Or the worker sends SIGTERM/SIGKILL correctly now (hence no leftover process stayed as it used to be the case) but it was still around for a while (maybe until an uninterruptible system call finally timed out).
- The last job that ran on the slot was also still affected (https://openqa.suse.de/tests/16518078). So the leftover process must have nevertheless been around for over an hour.
- EDIT: New jobs have been scheduled on the slot since then and passed as expected (since no leftover process was around anymore).
- The last job that ran on the slot was also still affected (https://openqa.suse.de/tests/16518078). So the leftover process must have nevertheless been around for over an hour.
Involved jobs:
openqa=> select id, state, result, t_created, t_started, t_finished, assigned_worker_id from jobs where id in ( 16517840, 16516784, 16518078 ) order by t_started;
id | state | result | t_created | t_started | t_finished | assigned_worker_id
----------+-------+------------+---------------------+---------------------+---------------------+--------------------
16516249 | done | incomplete | 2025-01-20 21:47:47 | 2025-01-21 00:51:11 | 2025-01-21 01:10:54 | 2597 <- worker31:31
16516784 | done | incomplete | 2025-01-21 00:19:31 | 2025-01-21 01:12:02 | 2025-01-21 01:16:31 | 2597
16517840 | done | incomplete | 2025-01-21 02:13:19 | 2025-01-21 02:13:25 | 2025-01-21 02:13:59 | 2597
16518078 | done | incomplete | 2025-01-21 03:27:10 | 2025-01-21 03:27:16 | 2025-01-21 03:28:13 | 2597
Updated by mkittler about 1 month ago
It looks like two worker slots were affected by the problem after the deployment of the fix:
openqa=> select count(id), min(state) as state, min(result) as result, min(reason) as reason, min(t_finished) as t_finished_min, max(t_finished) as t_finished_max, min((select concat(host, ':', instance) from workers where workers.id = jobs.assigned_worker_id)) as worker_slot from jobs where t_started >= '2025-01-20 14:00' and (select count(comments.id) from comments where comments.job_id = jobs.id and text ilike '%poo#170209%') > 0 group by jobs.assigned_worker_id order by count(jobs.id) desc limit 100;
count | state | result | reason | t_finished_min | t_finished_max | worker_slot
-------+-------+------------+-------------------------------------------------------------+---------------------+---------------------+---------------------
328 | done | incomplete | backend died: QEMU exited unexpectedly, see log for details | 2025-01-20 16:28:43 | 2025-01-21 04:33:36 | qesapworker-prg7:19
51 | done | incomplete | backend died: QEMU exited unexpectedly, see log for details | 2025-01-21 01:16:31 | 2025-01-21 03:28:13 | worker31:31
When my change was not effective the problem was also only happening on two slots:
openqa=> select count(id), min(state) as state, min(result) as result, min(reason) as reason, min(t_finished) as t_finished_min, max(t_finished) as t_finished_max, min((select concat(host, ':', instance) from workers where workers.id = jobs.assigned_worker_id)) as worker_slot from jobs where t_started >= '2025-01-18 00:00' and t_started < '2025-01-20' and (select count(comments.id) from comments where comments.job_id = jobs.id and text ilike '%poo#170209%') > 0 group by jobs.assigned_worker_id order by count(jobs.id) desc limit 100;
count | state | result | reason | t_finished_min | t_finished_max | worker_slot
-------+-------+------------+-------------------------------------------------------------+---------------------+---------------------+---------------------
57 | done | incomplete | backend died: QEMU exited unexpectedly, see log for details | 2025-01-19 20:16:33 | 2025-01-19 23:19:17 | qesapworker-prg7:19
35 | done | incomplete | backend died: QEMU exited unexpectedly, see log for details | 2025-01-18 00:10:37 | 2025-01-19 00:40:39 | qesapworker-prg6:20
Slot worker31:31
is the one mentioned in the previous note. Slot qesapworker-prg7:19
still had a leftover process from before the deployment of my changes so it can be ignored.
This leaves finding out what happened on worker31:31
.
Updated by mkittler about 1 month ago · Edited
I ran the first query from the previous note again. The results are identical. So there are no new occurrences. The change probably helps although I still don't know what happened on worker31:31
.
I verified the install date of RWP on worker31 and it was definitely installed before the problematic test was started (2025-01-21 00:51:11):
martchus@worker31:~> rpm -qi perl-Mojo-IOLoop-ReadWriteProcess
Name : perl-Mojo-IOLoop-ReadWriteProcess
Version : 0.340.0
Release : lp156.43.1
Architecture: noarch
Install Date: Mon 20 Jan 2025 11:00:39 AM UTC
/usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm
also contains the new/fixed code.
The job that probably caused the leftover process terminated like this by the way:
[2025-01-21T01:10:38.743738Z] [warn] [pid:59698] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 2.49 seconds for 55 candidate needles - make your needles more specific
[2025-01-21T01:10:38.743844Z] [debug] [pid:59698] no match: 0.8s, best candidate: bootmenu-SLES-generic-15SP6-20230809 (0.29)
[2025-01-21T01:10:38.745097Z] [debug] [pid:59698] considering VNC stalled, no update for 4.24 seconds
[2025-01-21T01:10:41.184238Z] [warn] [pid:59698] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 2.44 seconds for 55 candidate needles - make your needles more specific
[2025-01-21T01:10:41.184331Z] [debug] [pid:59698] no match: -1.7s, best candidate: bootmenu-SLES-generic-15SP6-20230809 (0.29)
[2025-01-21T01:10:44.597147Z] [debug] [pid:59480] backend process exited: 0
[2025-01-21T01:10:44.597482Z] [debug] [pid:59480] isotovideo handle_shutdown
[2025-01-21T01:10:44.597572Z] [debug] [pid:59480] stopping command server 59493 because test execution ended
[2025-01-21T01:10:44.597641Z] [debug] [pid:59480] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20313/GMMM4XezHMH0KycZ/broadcast
[2025-01-21T01:10:44.599935Z] [info] [pid:59480] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 59888 and exit status: 0
[2025-01-21T01:10:44.613112Z] [debug] [pid:59480] commands process exited: 0
[2025-01-21T01:10:44.713381Z] [debug] [pid:59480] done with command server
[2025-01-21T01:10:44.713487Z] [debug] [pid:59480] stopping autotest process 59502
[2025-01-21T01:10:44.713673Z] [debug] [pid:59502] autotest received signal TERM, saving results of current test before exiting
[2025-01-21T01:10:44.731586Z] [debug] [pid:59480] [autotest] process exited: 1
[2025-01-21T01:10:44.831837Z] [debug] [pid:59480] done with autotest process
[2025-01-21T01:10:44.831910Z] [debug] [pid:59480] isotovideo failed
[2025-01-21T01:10:44.832682Z] [debug] [pid:59480] stopping backend process 59698
[2025-01-21T01:10:44.832754Z] [debug] [pid:59480] done with backend process
59480: EXIT 1
[2025-01-21T01:10:44.897741Z] [info] [pid:11182] Isotovideo exit status: 1
So probably some backend process terminated unexpectedly. This is of course a problem of its own and I'm not sure what to do about it based on the information I could gather.
Updated by mkittler about 1 month ago
The query still gives the same results. I will resolve this issue tomorrow when there were no further occurrences until then. (4 days without new occurrences must be good enough.)
I would not wait with resolving this issue until the new RWP version is released because it would make sense to improve the RWP unit tests before doing that - especially as we will create a ticket about it after the 5 whys analysis.
Updated by mkittler about 1 month ago
- Status changed from Feedback to Resolved
Still no further occurrences so I'm resolving this ticket as mentioned in the previous comment.