Actions

action #170209

closed

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #175515: [epic] incomplete jobs with "Failed to find an available port: Address already in use"

[sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M

Added by nicksinger 6 months ago. Updated about 1 month ago.

Status:

Resolved

Priority:

High

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-11-25

Due date:

% Done:

Estimated time:

Tags:

osd, reactive work

Description

Observation¶

I received an alert mail about an incomplete job: https://openqa.suse.de/tests/15996418
It fails with:

[2024-11-25T00:19:57.525877Z] [warn] [pid:103270] !!! : qemu-system-x86_64: -vnc :102,share=force-shared: Failed to find an availabale port: Address already in use

I asked in Slack and @tinita observed the same since "a few hours" - apparently all on worker39

https://openqa.suse.de/admin/workers/2898 shows that this started about 12 hours ago with this job: https://openqa.suse.de/tests/15995445
The qemu-system-x86_64 process 1963 is running since 12 hours with -vnc :102,share=force-shared.

Acceptance criteria¶

AC1: Affected jobs are restarted automatically
AC2: We have a better understanding of situations where this can happen (if at all)

Suggestion¶

Check one more time for bugs – also consider testing (!) – in the code for handling leftover QEMU processes
Check one more time for bugs – also consider testing (!) – in terminating/killing the process group of isotovideo (in Mojo::…::ReadWriteProcess)
Add/enable debug logging when starting/stopping isotovideo (maybe on ReadWriteProcess level)
Consider starting/stopping isotovideo in a process group with low-level Perl code to replicate the error and investigate and potentially replace the problematic Mojo::ReadWriteProcess?

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by tinita 6 months ago

Description updated (diff)

Actions

Copy link

Updated by okurz 6 months ago

Tags set to reactive work, osd
Category set to Regressions/Crashes
Priority changed from Normal to Urgent
Target version set to Ready

Actions

Copy link

Updated by tinita 6 months ago

Priority changed from Urgent to High

% ps aux | grep qemu
...
_openqa+   1963 98.4  0.4 7061392 2206548 ?     Sl   Nov24 750:16 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 2048 -cpu qemu64 -netdev user,id=qanet0,hostfwd=tcp::39102-:39102 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -device qemu-xhci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :102,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/12/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/12/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on,discard=unmap -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-vars-overlay0,unit=1
% date
Mon Nov 25 11:10:00 AM UTC 2024

The process was started 22:28, so it must be from this test:
https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-201

I stopped the worker instance for now:

systemctl stop openqa-worker-auto-restart@12.service

So I think the urgency can be reduced.

Actions

Copy link

Updated by tinita 6 months ago

Stopping the instance apparently killed the process, so I started the instance again.

Actions

Copy link

Updated by jbaier_cz 6 months ago

Related to action #166025: qemu process hanging indefinitely blocking the corresponding VNC port size:S added

Actions

Copy link

Updated by tinita 6 months ago

I restarted all affected jobs

Actions

Copy link

Updated by mkittler 6 months ago

Subject changed from "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 to "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz 5 months ago

Priority changed from High to Normal

Actions

Copy link

Updated by nicksinger 5 months ago

There was another occurrence of this today on worker39 reported by Martin via slack. The test failed similar to the previous report:

[2024-12-10T00:49:11.761843Z] [debug] [pid:121552] starting: /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 1024 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -boot order=c -device qemu-xhci -device usb-tablet -smp 16,threads=2 -enable-kvm -no-shutdown -vnc :103,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/13/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0
[2024-12-10T00:49:11.768614Z] [debug] [pid:121552] Waiting for 0 attempts
dmesg: read kernel buffer failed: Operation not permitted
[2024-12-10T00:49:12.073274Z] [debug] [pid:121552] Waiting for 1 attempts
[2024-12-10T00:49:12.073470Z] [info] [pid:121552] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  QEMU terminated before QMP connection could be established. Check for errors below
[2024-12-10T00:49:12.073805Z] [info] [pid:121552] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-12-10T00:49:12.074785Z] [debug] [pid:121552] Passing remaining frames to the video encoder
[2024-12-10T00:49:12.078140Z] [debug] [pid:121552] Waiting for video encoder to finalize the video
[2024-12-10T00:49:12.078216Z] [debug] [pid:121552] The built-in video encoder (pid 121742) terminated
[2024-12-10T00:49:12.079343Z] [debug] [pid:121552] QEMU: QEMU emulator version 8.2.7 (SUSE Linux Enterprise 15)
[2024-12-10T00:49:12.079413Z] [debug] [pid:121552] QEMU: Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers
[2024-12-10T00:49:12.079491Z] [warn] [pid:121552] !!! : qemu-system-x86_64: warning: This family of AMD CPU doesn't support hyperthreading(2)
[2024-12-10T00:49:12.079543Z] [debug] [pid:121552] QEMU: Please configure -smp options properly or try enabling topoext feature.
[2024-12-10T00:49:12.079582Z] [warn] [pid:121552] !!! : qemu-system-x86_64: -vnc :103,share=force-shared: Failed to find an available port: Address already in use
[2024-12-10T00:49:12.080349Z] [debug] [pid:121552] sending magic and exit
[2024-12-10T00:49:12.080650Z] [debug] [pid:120871] received magic close
[2024-12-10T00:49:12.089917Z] [debug] [pid:120871] backend process exited: 0
[2024-12-10T00:49:12.190817Z] [warn] [pid:120871] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104.
  	backend::driver::start_vm(backend::driver=HASH(0x55b9f6c2f468)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18
  	OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 109
  	OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 251
  	OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/bin/isotovideo line 183
  	eval {...} called at /usr/bin/isotovideo line 178
  
[2024-12-10T00:49:12.191180Z] [debug] [pid:120871] stopping command server 121330 because test execution ended through exception
[2024-12-10T00:49:12.391590Z] [debug] [pid:120871] done with command server
[2024-12-10T00:49:12.391689Z] [debug] [pid:120871] stopping autotest process 121362
[2024-12-10T00:49:12.592074Z] [debug] [pid:120871] done with autotest process
120871: EXIT 1
[2024-12-10T00:49:12.660935Z] [info] [pid:4058] Isotovideo exit status: 1
[2024-12-10T00:49:12.693330Z] [info] [pid:4058] +++ worker notes +++
[2024-12-10T00:49:12.693534Z] [info] [pid:4058] End time: 2024-12-10 00:49:12
[2024-12-10T00:49:12.693648Z] [info] [pid:4058] Result: died
[2024-12-10T00:49:12.701384Z] [info] [pid:121845] Uploading autoinst-log.txt

There are some more details in the slack thread but I didn't follow closely and don't completely understand what the "cause" of this was.

Actions

Copy link

#10

Updated by gpathak 5 months ago · Edited

As @tinita suggested to find the job that actually started the leftover qemu process.
I found that on worker39:13, the job https://openqa.suse.de/tests/16150344/ was cancelled and after looking at the logs https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8085 found that QEMU didn't get signal 15 from isotovideo.
So, this was the first job on worker39:13 which didn't close the QEMU session gracefully and caused "Address Already in Use" error on further retries.

Marius mentioned that we already have an open ticket for the scenario when QEMU sometimes doesn't exit gracefully.
Maybe this is also related: #136238

Actions

Copy link

#11

Updated by tinita 5 months ago

I found that both processes causing the hanging qemu process have this in common, with a tiny difference:
https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8091

Fatal glibc error: malloc.c:4189 (_int_malloc): assertion failed: chunk_main_arena (fwd)

https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-5639

Fatal glibc error: malloc.c:4172 (_int_malloc): assertion failed: chunk_main_arena (bck->bk)

In both cases we have ::: backend::driver::_collect_orphan: Driver backend collected unknown process after that, but that can't be the qemu process, because we know it was still running, causing the actual problem.

Actions

Copy link

#12

Updated by nicksinger 5 months ago

this time we have something on openqaworker18. The last good job was:

https://openqa.suse.de/tests/16218776
https://openqa.suse.de/tests/16219186 <- first incomplete but without a "Failed to find an available port"-message
https://openqa.suse.de/tests/16219574 <- first incomplete with this message

what I noticed while trying to understand why 16219186 in-completed:

[2024-12-17T19:00:35.155667Z] [info] [pid:38979] Preparing cgroup to start isotovideo
[2024-12-17T19:00:35.156204Z] [warn] [pid:38979] Disabling cgroup usage because cgroup creation failed: mkdir /sys/fs/cgroup/systemd: Permission denied at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 85.
[2024-12-17T19:00:35.156268Z] [info] [pid:38979] You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROUP_FS.

but not sure if related or not.

Actions

Copy link

#13

Updated by nicksinger 5 months ago

Another broken instance on worker18: https://openqa.suse.de/tests/16236652 - I might reboot worker18 if nobody wants to look into it right now (asked in slack)

Actions

Copy link

#14

Updated by okurz 5 months ago

Subject changed from "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M to [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M
Priority changed from Normal to High

Also happened in https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3573426#L52 and there is no automatic retrying yet so we should handle this with higher priority. I also added an "auto_review" expression.

Actions

Copy link

#15

Updated by gpuliti 5 months ago

Description updated (diff)
Assignee set to gpuliti

Actions

Copy link

#16

Updated by gpuliti 5 months ago

Status changed from Workable to In Progress

Actions

Copy link

#17

Updated by gpuliti 5 months ago

Status changed from In Progress to Workable
Assignee deleted (~~gpuliti~~)

Actions

Copy link

#18

Updated by livdywan 4 months ago

Status changed from Workable to In Progress
Assignee set to livdywan

I'm investigating this atm

Actions

Copy link

#19

Updated by openqa_review 4 months ago

Due date set to 2025-01-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#20

Updated by tinita 4 months ago · Edited

To search for recent failures like this you can use:

select concat('https://openqa.suse.de/t', id), id, assigned_worker_id, t_started, state, result, test, reason from jobs where reason = 'backend died: QEMU exited unexpectedly, see log for details' order by t_started desc limit 20;

It can theoretically be a different error, so the log still needs to be checked.

And we should kill the qemu process on that worker, so that the next job on this worker instance won't run into the same issue.

Actions

Copy link

#21

Updated by livdywan 4 months ago

Status changed from In Progress to Workable
Assignee deleted (~~livdywan~~)

I'm afraid there is something more important on my mind.

Actions

Copy link

#22

Updated by mkittler 4 months ago

Status changed from Workable to In Progress
Assignee set to mkittler

Actions

Copy link

#23

Updated by mkittler 4 months ago · Edited

We saw this again on multiple slots. Example: https://openqa.suse.de/admin/workers/3377

Notes from my investigation:

Normally the worker slot should at least show up as broken but here it doesn't (it is idling).

systemctl shows the leftover process quite clearly:

openqa-worker-auto-restart@23.service - openQA Worker #23
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
     Active: active (running) since Sun 2025-01-12 05:18:31 UTC; 1 day 9h ago
    Process: 15228 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/23 (code=exited, status=0/SUCCESS)
   Main PID: 15237 (worker)
      Tasks: 7 (limit: 19660)
        CPU: 7h 31min 49.917s
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@23.service
             ├─15237 /usr/bin/perl /usr/share/openqa/script/worker --instance 23
             └─69998 /usr/bin/qemu-system-x86_64 -device virtio-vga,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -devi>

The leftover process can just be terminated, e.g. sudo kill 69998.

Link to job that caused the leftover process: https://openqa.suse.de/tests/16427146

Relevant worker logs:

Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [info] [pid:73422] Uploading serial_terminal.txt
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [debug] [pid:73422] Uploading artefact serial_terminal.txt
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [info] [pid:73422] Uploading serial_terminal_user.txt
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [debug] [pid:73422] Uploading artefact serial_terminal_user.txt
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Job 16427146 stopped as incomplete
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Unable to read result-check_network.json: Can't open file "/var/lib/openqa/pool/23/testresults/result-check_network.json": No such file or directory at />
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::File::slurp(Mojo::File=SCALAR(0x562378437cd8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1135
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1135
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_read_json_file(OpenQA::Worker::Job=HASH(0x562379180a18), "result-check_network.json") called at /usr/share/openqa/script/../lib/OpenQA/Worker/J>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_read_module_result(OpenQA::Worker::Job=HASH(0x562379180a18), "check_network") called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm li>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_read_result_file(OpenQA::Worker::Job=HASH(0x562379180a18), "", ARRAY(0x5623791c2740)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/J>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_upload_results_step_0_prepare(OpenQA::Worker::Job=HASH(0x562379180a18), CODE(0x5623783eb540)) called at /usr/share/openqa/script/../lib/OpenQA/>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_upload_results(OpenQA::Worker::Job=HASH(0x562379180a18), CODE(0x5623783eb540)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm l>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_stop_step_5_2_upload(OpenQA::Worker::Job=HASH(0x562379180a18), "died", CODE(0x56237930eda8)) called at /usr/share/openqa/script/../lib/OpenQA/W>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::__ANON__() called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 120
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_invoke_after_result_upload(OpenQA::Worker::Job=HASH(0x562379180a18), CODE(0x562379312558)) called at /usr/share/openqa/script/../lib/OpenQA/Wor>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_stop_step_5_1_upload(OpenQA::Worker::Job=HASH(0x562379180a18), "died", CODE(0x56237930eda8)) called at /usr/share/openqa/script/../lib/OpenQA/W>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::__ANON__(Mojo::IOLoop::Subprocess=HASH(0x56237930b258), "", "died") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Subprocess.pm line 85
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::Subprocess::__ANON__(Mojo::IOLoop::Stream=HASH(0x56237930f270)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/EventEmitter.pm line 15
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::EventEmitter::emit(Mojo::IOLoop::Stream=HASH(0x56237930f270), "close") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 27
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::Stream::close(Mojo::IOLoop::Stream=HASH(0x56237930f270)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 109
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::Stream::_read(Mojo::IOLoop::Stream=HASH(0x56237930f270)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 57
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::Stream::__ANON__(Mojo::Reactor::Poll=HASH(0x56237222f690)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::Reactor::Poll::_try(Mojo::Reactor::Poll=HASH(0x56237222f690), "I/O watcher", CODE(0x56237930d178), 0) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::Reactor::Poll::one_tick(Mojo::Reactor::Poll=HASH(0x56237222f690)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 101
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::Reactor::Poll::start(Mojo::Reactor::Poll=HASH(0x56237222f690)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 134
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::start(Mojo::IOLoop=HASH(0x562373b66ac8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 369
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x562375602398)) called at /usr/share/openqa/script/worker line 120
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: [debug] [pid:15237] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16427146/status"
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/glibc_locale-8.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-8.png as c62e2548b2e63f5b68ffe0f91e6bc96b
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/.thumbs/glibc_locale-8.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-8.png as c62e2548b2e63f5b68ffe0f91e6bc96b
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/glibc_locale-1.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-1.png as d804d6cb366b2158b5210b343401c366
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/.thumbs/glibc_locale-1.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-1.png as d804d6cb366b2158b5210b343401c366
…
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/.thumbs/glibc_locale-3.png
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-3.png as 9fb8d6a2417669f089c1ce447b73e1ff
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-9.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-21.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-20.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-25.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-24.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-23.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-7.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-27.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-18.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-16.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-19.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-22.txt
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Upload concluded (at glibc_locale)
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16427146/set_done?reason=died%3A+terminated+prematurely%2C+see+log+output+for+details&result=incom>
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Job 16427146 from openqa.suse.de finished - reason: died
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Cleaning up for next job
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Accepting job 16427266 from openqa.suse.de.
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Setting job 16427266 from openqa.suse.de up
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Preparing Mojo::IOLoop::ReadWriteProcess::Session
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [info] [pid:15237] +++ setup notes +++
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [info] [pid:15237] Running on qesapworker-prg5:23 (Linux 6.4.0-150600.23.30-default #1 SMP PREEMPT_DYNAMIC Sat Dec  7 08:37:53 UTC 2024 (8c25a0a) x86_64)
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Job settings:
…

Relevant os-autoinst logs:

[2025-01-13T14:07:03.162048Z] [debug] [pid:69720] no change: 86.6s
[2025-01-13T14:07:03.510529Z] [debug] [pid:69720] no match: 86.6s, best candidate: grub2-20241114 (0.18)
[2025-01-13T14:07:07.026919Z] [warn] [pid:69720] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 2.86 seconds for 54 candidate needles - make your needles more specific
[2025-01-13T14:07:07.027016Z] [debug] [pid:69720] no match: 85.6s, best candidate: grub2-20241114 (0.18)
[2025-01-13T14:07:07.028906Z] [debug] [pid:69720] considering VNC stalled, no update for 4.52 seconds
[2025-01-13T14:07:07.038750Z] [debug] [pid:69720] no change: 82.7s
[2025-01-13T14:07:07.378976Z] [debug] [pid:69720] no match: 82.7s, best candidate: grub2-20241114 (0.18)
[2025-01-13T14:07:07.379278Z] [debug] [pid:69720] pointer type 1 0 1024 768 -257
[2025-01-13T14:07:07.379322Z] [debug] [pid:69720] led state 0 1 1 -261
corrupted size vs. prev_size
Error while decoding stream #0:0: Invalid data found when processing input
[2025-01-13T14:07:11.182059Z] [debug] [pid:69511] backend process exited: 0
[2025-01-13T14:07:11.183220Z] [debug] [pid:69511] isotovideo handle_shutdown
[2025-01-13T14:07:11.183480Z] [debug] [pid:69511] stopping command server 69521 because test execution ended
[2025-01-13T14:07:11.183704Z] [debug] [pid:69511] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20233/wxFqLKHC9jtwtMlN/broadcast
[2025-01-13T14:07:11.184467Z] [info] [pid:69511] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 69913 and exit status: 0
[2025-01-13T14:07:11.205481Z] [debug] [pid:69511] commands process exited: 0
[2025-01-13T14:07:11.305912Z] [debug] [pid:69511] done with command server
[2025-01-13T14:07:11.306221Z] [debug] [pid:69511] stopping autotest process 69530
[2025-01-13T14:07:11.306715Z] [debug] [pid:69530] autotest received signal TERM, saving results of current test before exiting
[2025-01-13T14:07:11.325243Z] [debug] [pid:69511] [autotest] process exited: 1
[2025-01-13T14:07:11.425890Z] [debug] [pid:69511] done with autotest process
[2025-01-13T14:07:11.426033Z] [debug] [pid:69511] isotovideo failed
[2025-01-13T14:07:11.428938Z] [debug] [pid:69511] stopping backend process 69720
[2025-01-13T14:07:11.429133Z] [debug] [pid:69511] done with backend process
69511: EXIT 1
[2025-01-13T14:07:11.521543Z] [info] [pid:15237] Isotovideo exit status: 1
[2025-01-13T14:07:11.546777Z] [info] [pid:15237] +++ worker notes +++
[2025-01-13T14:07:11.547021Z] [info] [pid:15237] End time: 2025-01-13 14:07:11
[2025-01-13T14:07:11.547142Z] [info] [pid:15237] Result: died
[2025-01-13T14:07:11.554374Z] [info] [pid:73422] Uploading image_info-basic_health_check.txt
[2025-01-13T14:07:11.581017Z] [info] [pid:73422] Uploading image_info-dmesg.txt
[2025-01-13T14:07:11.601762Z] [info] [pid:73422] Uploading image_info-ip-addr-show.txt
[2025-01-13T14:07:11.615916Z] [info] [pid:73422] Uploading image_info-journal.txt
[2025-01-13T14:07:11.681871Z] [info] [pid:73422] Uploading image_info-loadavg.txt
[2025-01-13T14:07:11.702895Z] [info] [pid:73422] Uploading image_info-problem_detection_logs.tar.xz
frame= 1353 fps=1.4 q=0.0 Lsize=    2068kB time=00:00:56.33 bitrate= 300.7kbits/s speed=0.059x    
video:2058kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.462934%
[2025-01-13T14:07:11.726346Z] [info] [pid:73422] Uploading image_info-psaxf.txt
[2025-01-13T14:07:11.743446Z] [info] [pid:73422] Uploading image_info-resolv-conf.txt
[2025-01-13T14:07:11.760713Z] [info] [pid:73422] Uploading image_info-sysconfig.tar.gz
[2025-01-13T14:07:11.780262Z] [info] [pid:73422] Uploading image_info-systemctl.txt
[2025-01-13T14:07:11.805747Z] [info] [pid:73422] Uploading image_info-systemctl_status.txt
[2025-01-13T14:07:11.830682Z] [info] [pid:73422] Uploading image_info-systemctl_unit-files.txt
[2025-01-13T14:07:11.850402Z] [info] [pid:73422] Uploading journal_check-audit.tar
[2025-01-13T14:07:11.870232Z] [info] [pid:73422] Uploading journal_check-full_journal.log
[2025-01-13T14:07:11.933653Z] [info] [pid:73422] Uploading pam-results.tap
[2025-01-13T14:07:11.961788Z] [info] [pid:73422] Uploading video.webm
[2025-01-13T14:07:12.037806Z] [info] [pid:73422] Uploading autoinst-log.txt

Actions

Copy link

#24

Updated by mkittler 4 months ago

This needs to be fixed in RWP (unless we want to inherit from its package and reimplement is_running and stop on our side): https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/61

I also came up with a small improvement for openQA: https://github.com/os-autoinst/openQA/pull/6110

Actions

Copy link

#25

Updated by mkittler 4 months ago · Edited

Status changed from In Progress to Feedback

Both PRs have been merged.

I also created https://build.opensuse.org/package/show/devel:openQA/perl-Mojo-IOLoop-ReadWriteProcess and deployed it on OSD workers via sudo salt -C 'G@roles:worker' cmd.run 'zypper in --allow-vendor-change perl-Mojo-IOLoop-ReadWriteProcess'. Judging by the output of sudo salt -C 'G@roles:worker' cmd.run "grep -- '-\$pid' /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm" the new code is in place.

Note that the package hasn't been published for aarch64 yet so if we see it on that architecture that wouldn't necessarily mean that the fix didn't work.

So let's see whether this fixes the issue.

Actions

Copy link

#26

Updated by mkittler 4 months ago

We had to revert changes due to a race condition.

This change will fix it: https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/62

I'll continue deploying it later but only on a single worker.

I'll also try to improve tests in Mojo-RWP.

Actions

Copy link

#27

Updated by okurz 4 months ago

Related to action #175464: jobs incomplete with auto_review:"setup failure: isotovideo can not be started" added

Actions

Copy link

#28

Updated by mkittler 4 months ago · Edited

This is also a bit problematic when it comes to stopping the process. That means if we call stop before setpgrp(0, 0) runs (or setpgrp(0, 0) does not run after all for some reason) the process cannot be stopped. We always had this problem (also before my recent changes) so supposedly it doesn't really matter in practice. However, here's a daft for how fixing this: https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/64

Actions

Copy link

#29

Updated by okurz 4 months ago

Parent task set to #175515

Actions

Copy link

#30

Updated by mkittler 4 months ago

A version of RWP I will deploy tomorrow on a single production test host to see whether my latest changed work in production: https://build.opensuse.org/package/show/home:mkittler:branches:devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess

Actions

Copy link

#31

Updated by mkittler 4 months ago

As mentioned on Slack:

I now deployed a fixed version of the change on openqaworker14.qa.suse.cz because that was one of the previously affected machines. If you see any issues, simply disable the repository home_mkittler_test_poo170209 I added there and install the regular version of RWP again via zypper in --allow-vendor-change perl-Mojo-IOLoop-ReadWriteProcess. (Of course I'll handle that, just in case I'm not around.)

Actions

Copy link

#32

Updated by mkittler 4 months ago

So far the situation looks good on openqaworker14:

openqa=> select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqaworker14' and result = 'incomplete' and t_finished >= '2025-01-16T12:00:00' order by t_finished;
    id    |     t_finished      |   result   |                           reason                            
----------+---------------------+------------+-------------------------------------------------------------
 16479164 | 2025-01-16 14:01:27 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479166 | 2025-01-16 14:02:29 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479171 | 2025-01-16 14:03:26 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479176 | 2025-01-16 14:04:46 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479177 | 2025-01-16 14:05:50 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479178 | 2025-01-16 14:06:53 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479179 | 2025-01-16 14:07:57 | incomplete | backend died: QEMU exited unexpectedly, see log for details
(7 rows)

These incompletes were caused by a leftover process on slot 14 which I have cleaned up only now. The leftover was from https://openqa.suse.de/tests/16472470 which ran 12 hours ago which was before I deployed my changes.

That means there were so far no relevant leftover processes and also no regressions. (I also checked failed jobs but also couldn't find any relevant jobs.)

Actions

Copy link

#33

Updated by rfan1 4 months ago

I can hit the issue again, https://openqa.suse.de/tests/16486907

Can you please help double check?

Actions

Copy link

#34

Updated by mkittler 4 months ago

This slot looks fine. I assume you have already killed the leftover process yourself then.

I ran the query from yesterday again to check on openqaworker14 and it looks good. So I'm going to deploy my change everywhere again.

Actions

Copy link

#35

Updated by mkittler 4 months ago

Considering there are other problems with OSD right now we decided to deploy the fix everywhere only next week.

Until then, here's how you can workaround it:

Check whether there's a leftover process:

martchus@qesapworker-prg6:~> sudo systemctl status openqa-worker-auto-restart@8
● openqa-worker-auto-restart@8.service - openQA Worker #8
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
     Active: active (running) since Sun 2025-01-12 05:18:28 UTC; 1 day 9h ago
    Process: 21274 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/8 (code=exited, status=0/SUCCESS)
   Main PID: 21280 (worker)
      Tasks: 9 (limit: 19660)
        CPU: 1h 45min 11.987s
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@8.service
             ├─21280 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
             ├─23645 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device inte>
             └─75353 swtpm socket --tpmstate dir=/tmp/mytpm8 --ctrl type=unixio,path=/tmp/mytpm8/swtpm-sock --log level=20 -d

Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         Mojo::Reactor::Poll::_try(Mojo::Reactor::Poll=HASH(0x555de2562340), "I/O watcher", CODE(0x555de6399db8), 0) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/>
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         Mojo::Reactor::Poll::one_tick(Mojo::Reactor::Poll=HASH(0x555de2562340)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 101
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         Mojo::Reactor::Poll::start(Mojo::Reactor::Poll=HASH(0x555de2562340)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 134
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         Mojo::IOLoop::start(Mojo::IOLoop=HASH(0x555de3e99fe8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 369
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x555de5a1a798)) called at /usr/share/openqa/script/worker line 120
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] Upload concluded (no current module)
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16436185/set_done?reason=backend+died%3A+QEMU+exited+unexpectedly%2C+see+log+for+details&result=in>
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] Job 16436185 from openqa.suse.de finished - reason: died
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] Cleaning up for next job

Terminate it:

martchus@qesapworker-prg6:~> sudo kill 23645 75353

Run kill again to see if it worked:

martchus@qesapworker-prg6:~> sudo kill 23645 75353
kill: sending signal to 23645 failed: No such process
kill: sending signal to 75353 failed: No such process

Actions

Copy link

#36

Updated by mkittler 4 months ago

Everything looks still good on openqaworker14 so I'm going to bring in my changes again on all workers.

So I ran osc copypac home:mkittler:branches:devel:languages:perl perl-Mojo-IOLoop-ReadWriteProcess devel:openQA to copy the link I have on my home project to devel:openQA and will do an update of all OSD workers as soon as the build has finished.

Actions

Copy link

#37

Updated by okurz 4 months ago

Due date changed from 2025-01-21 to 2025-01-24

Given that we decided to explicitly delay the deployment we should give some more days for monitoring. Updating due-date accordingly.

Actions

Copy link

#38

Updated by mkittler 4 months ago · Edited

Considering https://openqa.suse.de/tests/16517840 this might not fix all cases or only one part of the problem:

The first occurrence of the issue on that slot was https://openqa.suse.de/tests/16516784 and before ran https://openqa.suse.de/tests/16516249 which probably caused the leftover process. So all jobs ran when the fix was already in place.
Looking at the worker slot right now I cannot see a leftover process, though.
- So maybe there was another reason why the port was blocked (and not a leftover process)? That's not very likely.
- Or the worker sends SIGTERM/SIGKILL correctly now (hence no leftover process stayed as it used to be the case) but it was still around for a while (maybe until an uninterruptible system call finally timed out).
  - The last job that ran on the slot was also still affected (https://openqa.suse.de/tests/16518078). So the leftover process must have nevertheless been around for over an hour.
    - EDIT: New jobs have been scheduled on the slot since then and passed as expected (since no leftover process was around anymore).

Involved jobs:

openqa=> select id, state, result, t_created, t_started, t_finished, assigned_worker_id from jobs where id in ( 16517840, 16516784, 16518078 ) order by t_started;
    id    | state |   result   |      t_created      |      t_started      |     t_finished      | assigned_worker_id 
----------+-------+------------+---------------------+---------------------+---------------------+--------------------
 16516249 | done  | incomplete | 2025-01-20 21:47:47 | 2025-01-21 00:51:11 | 2025-01-21 01:10:54 |               2597  <- worker31:31
 16516784 | done  | incomplete | 2025-01-21 00:19:31 | 2025-01-21 01:12:02 | 2025-01-21 01:16:31 |               2597
 16517840 | done  | incomplete | 2025-01-21 02:13:19 | 2025-01-21 02:13:25 | 2025-01-21 02:13:59 |               2597
 16518078 | done  | incomplete | 2025-01-21 03:27:10 | 2025-01-21 03:27:16 | 2025-01-21 03:28:13 |               2597

Actions

Copy link

#39

Updated by mkittler 4 months ago

It looks like two worker slots were affected by the problem after the deployment of the fix:

openqa=> select count(id), min(state) as state, min(result) as result, min(reason) as reason, min(t_finished) as t_finished_min, max(t_finished) as t_finished_max, min((select concat(host, ':', instance) from workers where workers.id = jobs.assigned_worker_id)) as worker_slot from jobs where t_started >= '2025-01-20 14:00' and (select count(comments.id) from comments where comments.job_id = jobs.id and text ilike '%poo#170209%') > 0 group by jobs.assigned_worker_id order by count(jobs.id) desc limit 100;
 count | state |   result   |                           reason                            |   t_finished_min    |   t_finished_max    |     worker_slot     
-------+-------+------------+-------------------------------------------------------------+---------------------+---------------------+---------------------
   328 | done  | incomplete | backend died: QEMU exited unexpectedly, see log for details | 2025-01-20 16:28:43 | 2025-01-21 04:33:36 | qesapworker-prg7:19
    51 | done  | incomplete | backend died: QEMU exited unexpectedly, see log for details | 2025-01-21 01:16:31 | 2025-01-21 03:28:13 | worker31:31

When my change was not effective the problem was also only happening on two slots:

openqa=> select count(id), min(state) as state, min(result) as result, min(reason) as reason, min(t_finished) as t_finished_min, max(t_finished) as t_finished_max, min((select concat(host, ':', instance) from workers where workers.id = jobs.assigned_worker_id)) as worker_slot from jobs where t_started >= '2025-01-18 00:00' and t_started < '2025-01-20' and (select count(comments.id) from comments where comments.job_id = jobs.id and text ilike '%poo#170209%') > 0 group by jobs.assigned_worker_id order by count(jobs.id) desc limit 100;
 count | state |   result   |                           reason                            |   t_finished_min    |   t_finished_max    |     worker_slot     
-------+-------+------------+-------------------------------------------------------------+---------------------+---------------------+---------------------
    57 | done  | incomplete | backend died: QEMU exited unexpectedly, see log for details | 2025-01-19 20:16:33 | 2025-01-19 23:19:17 | qesapworker-prg7:19
    35 | done  | incomplete | backend died: QEMU exited unexpectedly, see log for details | 2025-01-18 00:10:37 | 2025-01-19 00:40:39 | qesapworker-prg6:20

Slot worker31:31 is the one mentioned in the previous note. Slot qesapworker-prg7:19 still had a leftover process from before the deployment of my changes so it can be ignored.

This leaves finding out what happened on worker31:31.

Actions

Copy link

#40

Updated by mkittler 4 months ago · Edited

I ran the first query from the previous note again. The results are identical. So there are no new occurrences. The change probably helps although I still don't know what happened on worker31:31.

I verified the install date of RWP on worker31 and it was definitely installed before the problematic test was started (2025-01-21 00:51:11):

martchus@worker31:~> rpm -qi perl-Mojo-IOLoop-ReadWriteProcess 
Name        : perl-Mojo-IOLoop-ReadWriteProcess
Version     : 0.340.0
Release     : lp156.43.1
Architecture: noarch
Install Date: Mon 20 Jan 2025 11:00:39 AM UTC

/usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm also contains the new/fixed code.

The job that probably caused the leftover process terminated like this by the way:

[2025-01-21T01:10:38.743738Z] [warn] [pid:59698] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 2.49 seconds for 55 candidate needles - make your needles more specific
[2025-01-21T01:10:38.743844Z] [debug] [pid:59698] no match: 0.8s, best candidate: bootmenu-SLES-generic-15SP6-20230809 (0.29)
[2025-01-21T01:10:38.745097Z] [debug] [pid:59698] considering VNC stalled, no update for 4.24 seconds
[2025-01-21T01:10:41.184238Z] [warn] [pid:59698] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 2.44 seconds for 55 candidate needles - make your needles more specific
[2025-01-21T01:10:41.184331Z] [debug] [pid:59698] no match: -1.7s, best candidate: bootmenu-SLES-generic-15SP6-20230809 (0.29)
[2025-01-21T01:10:44.597147Z] [debug] [pid:59480] backend process exited: 0
[2025-01-21T01:10:44.597482Z] [debug] [pid:59480] isotovideo handle_shutdown
[2025-01-21T01:10:44.597572Z] [debug] [pid:59480] stopping command server 59493 because test execution ended
[2025-01-21T01:10:44.597641Z] [debug] [pid:59480] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20313/GMMM4XezHMH0KycZ/broadcast
[2025-01-21T01:10:44.599935Z] [info] [pid:59480] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 59888 and exit status: 0
[2025-01-21T01:10:44.613112Z] [debug] [pid:59480] commands process exited: 0
[2025-01-21T01:10:44.713381Z] [debug] [pid:59480] done with command server
[2025-01-21T01:10:44.713487Z] [debug] [pid:59480] stopping autotest process 59502
[2025-01-21T01:10:44.713673Z] [debug] [pid:59502] autotest received signal TERM, saving results of current test before exiting
[2025-01-21T01:10:44.731586Z] [debug] [pid:59480] [autotest] process exited: 1
[2025-01-21T01:10:44.831837Z] [debug] [pid:59480] done with autotest process
[2025-01-21T01:10:44.831910Z] [debug] [pid:59480] isotovideo failed
[2025-01-21T01:10:44.832682Z] [debug] [pid:59480] stopping backend process 59698
[2025-01-21T01:10:44.832754Z] [debug] [pid:59480] done with backend process
59480: EXIT 1
[2025-01-21T01:10:44.897741Z] [info] [pid:11182] Isotovideo exit status: 1

So probably some backend process terminated unexpectedly. This is of course a problem of its own and I'm not sure what to do about it based on the information I could gather.

Actions

Copy link

#41

Updated by mkittler 4 months ago

The query still gives the same results. I will resolve this issue tomorrow when there were no further occurrences until then. (4 days without new occurrences must be good enough.)

I would not wait with resolving this issue until the new RWP version is released because it would make sense to improve the RWP unit tests before doing that - especially as we will create a ticket about it after the 5 whys analysis.

Actions

Copy link

#42

Updated by mkittler 4 months ago

Status changed from Feedback to Resolved

Still no further occurrences so I'm resolving this ticket as mentioned in the previous comment.

Actions

Copy link

#43

Updated by okurz 4 months ago

Due date deleted (~~2025-01-24~~)

Actions

Copy link

#44

Updated by tinita about 2 months ago

It happened again: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/4064701
worker36 seemed to have that problem for a couple of hours: https://openqa.suse.de/admin/workers/2786

Actions

Copy link

#45

Updated by tinita about 2 months ago

Status changed from Resolved to Workable

Actions

Copy link

#46

Updated by mkittler about 1 month ago

The first affected job on that slot was https://openqa.suse.de/tests/17212785. The preceding job was https://openqa.suse.de/tests/17212307 and probably caused the leftover QEMU process.

Actions

Copy link

#47

Updated by mkittler about 1 month ago · Edited

Looks like this should have been fixed by https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/commit/ba7bb383a02c44a3d6340a900fbd8d179942c449 (and the fixup https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/commit/c4c79303145fce0880b9f1697a782840085d3c16).

This change is also deployed on worker36. Maybe we hit the rare case where QEMU really cannot be killed/terminated.

Of course it is also still bad that the worker didn't declare itself as broken and we've got incomplete jobs instead.

EDIT: I started a QEMU-based test locally to test whether our check for whether QEMU is running works for real. So I ran that check like this:

my $w = OpenQA::Worker->new({instance => 1});
$w->pool_directory('/hdd/openqa-devel/openqa/pool/1');
print("qemu running: " . $w->is_qemu_running . "\n");

It successfully printed the PID of the currently running QEMU process. The PID file is also exempt from cleanup if QEMU is still running. So I still have no idea why the check isn't working sometimes in production.

Actions

Copy link

#48

Updated by mkittler about 1 month ago

Subject changed from [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M to [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines

Let's reestimate this ticket.

Actions

Copy link

#49

Updated by okurz about 1 month ago

Subject changed from [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines to [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M

Sorry, we shouldn't reestimate tickets after so much effort was invested. Either ask for help to resolve this ticket or create a new ticket.

Actions

Copy link

#50

Updated by mkittler about 1 month ago

Precedes action #180110: [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines added

Actions

Copy link

#51

Updated by mkittler about 1 month ago

Status changed from Workable to Resolved

I created a follow-up ticket: #180110

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #170209

[sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M

Observation¶

Acceptance criteria¶

Suggestion¶

Updated by tinita 6 months ago

Updated by okurz 6 months ago

Updated by tinita 6 months ago

Updated by tinita 6 months ago

Updated by jbaier_cz 6 months ago

Updated by tinita 6 months ago

Updated by mkittler 6 months ago

Updated by okurz 5 months ago

Updated by nicksinger 5 months ago

Updated by gpathak 5 months ago · Edited

Updated by tinita 5 months ago

Updated by nicksinger 5 months ago

Updated by nicksinger 5 months ago

Updated by okurz 5 months ago

Updated by gpuliti 5 months ago

Updated by gpuliti 5 months ago

Updated by gpuliti 5 months ago

Updated by livdywan 4 months ago

Updated by openqa_review 4 months ago

Updated by tinita 4 months ago · Edited

Updated by livdywan 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by okurz 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by okurz 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by rfan1 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by okurz 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago · Edited

Updated by mkittler 4 months ago

Updated by mkittler 4 months ago

Updated by okurz 4 months ago

Updated by tinita about 2 months ago

Updated by tinita about 2 months ago

Updated by mkittler about 1 month ago

Updated by mkittler about 1 month ago · Edited

Updated by mkittler about 1 month ago

Updated by okurz about 1 month ago

Updated by mkittler about 1 month ago

Updated by mkittler about 1 month ago