Project

General

Profile

Actions

action #170209

open

coordination #102906: [saga][epic] Increased stability of tests with less "known failures", known incompletes handled automatically within openQA

coordination #175515: [epic] incomplete jobs with "Failed to find an available port: Address already in use"

[sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M

Added by nicksinger about 2 months ago. Updated 2 days ago.

Status:
Feedback
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-11-25
Due date:
2025-01-21 (Due in 2 days)
% Done:

0%

Estimated time:

Description

Observation

I received an alert mail about an incomplete job: https://openqa.suse.de/tests/15996418
It fails with:

[2024-11-25T00:19:57.525877Z] [warn] [pid:103270] !!! : qemu-system-x86_64: -vnc :102,share=force-shared: Failed to find an availabale port: Address already in use

I asked in Slack and @tinita observed the same since "a few hours" - apparently all on worker39

https://openqa.suse.de/admin/workers/2898 shows that this started about 12 hours ago with this job: https://openqa.suse.de/tests/15995445
The qemu-system-x86_64 process 1963 is running since 12 hours with -vnc :102,share=force-shared.

Acceptance criteria

  • AC1: Affected jobs are restarted automatically
  • AC2: We have a better understanding of situations where this can happen (if at all)

Suggestion

  • Check one more time for bugs – also consider testing (!) – in the code for handling leftover QEMU processes
  • Check one more time for bugs – also consider testing (!) – in terminating/killing the process group of isotovideo (in Mojo::…::ReadWriteProcess)
  • Add/enable debug logging when starting/stopping isotovideo (maybe on ReadWriteProcess level)
  • Consider starting/stopping isotovideo in a process group with low-level Perl code to replicate the error and investigate and potentially replace the problematic Mojo::ReadWriteProcess?

Related issues 2 (0 open2 closed)

Related to openQA Project (public) - action #166025: qemu process hanging indefinitely blocking the corresponding VNC port size:SResolvedmkittler2024-08-30

Actions
Related to openQA Project (public) - action #175464: jobs incomplete with auto_review:"setup failure: isotovideo can not be started"Resolvedokurz2025-01-15

Actions
Actions #1

Updated by tinita about 2 months ago

  • Description updated (diff)
Actions #2

Updated by okurz about 2 months ago

  • Tags set to reactive work, osd
  • Category set to Regressions/Crashes
  • Priority changed from Normal to Urgent
  • Target version set to Ready
Actions #3

Updated by tinita about 2 months ago

  • Priority changed from Urgent to High
% ps aux | grep qemu
...
_openqa+   1963 98.4  0.4 7061392 2206548 ?     Sl   Nov24 750:16 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 2048 -cpu qemu64 -netdev user,id=qanet0,hostfwd=tcp::39102-:39102 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -device qemu-xhci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :102,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-file,filename=/var/lib/openqa/pool/12/raid/hd0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0,file=hd0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/12/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on,discard=unmap -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 -drive id=pflash-code-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-code-overlay0,unit=0,readonly=on -drive id=pflash-vars-overlay0,if=pflash,file=/var/lib/openqa/pool/12/raid/pflash-vars-overlay0,unit=1
% date
Mon Nov 25 11:10:00 AM UTC 2024

The process was started 22:28, so it must be from this test:
https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-201

I stopped the worker instance for now:

systemctl stop openqa-worker-auto-restart@12.service

So I think the urgency can be reduced.

Actions #4

Updated by tinita about 2 months ago

Stopping the instance apparently killed the process, so I started the instance again.

Actions #5

Updated by jbaier_cz about 2 months ago

  • Related to action #166025: qemu process hanging indefinitely blocking the corresponding VNC port size:S added
Actions #6

Updated by tinita about 2 months ago

I restarted all affected jobs

Actions #7

Updated by mkittler about 2 months ago

  • Subject changed from "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 to "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by okurz about 2 months ago

  • Priority changed from High to Normal
Actions #9

Updated by nicksinger about 1 month ago

There was another occurrence of this today on worker39 reported by Martin via slack. The test failed similar to the previous report:

[2024-12-10T00:49:11.761843Z] [debug] [pid:121552] starting: /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 1024 -cpu qemu64 -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -boot order=c -device qemu-xhci -device usb-tablet -smp 16,threads=2 -enable-kvm -no-shutdown -vnc :103,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/13/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0
[2024-12-10T00:49:11.768614Z] [debug] [pid:121552] Waiting for 0 attempts
dmesg: read kernel buffer failed: Operation not permitted
[2024-12-10T00:49:12.073274Z] [debug] [pid:121552] Waiting for 1 attempts
[2024-12-10T00:49:12.073470Z] [info] [pid:121552] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
  QEMU terminated before QMP connection could be established. Check for errors below
[2024-12-10T00:49:12.073805Z] [info] [pid:121552] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2024-12-10T00:49:12.074785Z] [debug] [pid:121552] Passing remaining frames to the video encoder
[2024-12-10T00:49:12.078140Z] [debug] [pid:121552] Waiting for video encoder to finalize the video
[2024-12-10T00:49:12.078216Z] [debug] [pid:121552] The built-in video encoder (pid 121742) terminated
[2024-12-10T00:49:12.079343Z] [debug] [pid:121552] QEMU: QEMU emulator version 8.2.7 (SUSE Linux Enterprise 15)
[2024-12-10T00:49:12.079413Z] [debug] [pid:121552] QEMU: Copyright (c) 2003-2023 Fabrice Bellard and the QEMU Project developers
[2024-12-10T00:49:12.079491Z] [warn] [pid:121552] !!! : qemu-system-x86_64: warning: This family of AMD CPU doesn't support hyperthreading(2)
[2024-12-10T00:49:12.079543Z] [debug] [pid:121552] QEMU: Please configure -smp options properly or try enabling topoext feature.
[2024-12-10T00:49:12.079582Z] [warn] [pid:121552] !!! : qemu-system-x86_64: -vnc :103,share=force-shared: Failed to find an available port: Address already in use
[2024-12-10T00:49:12.080349Z] [debug] [pid:121552] sending magic and exit
[2024-12-10T00:49:12.080650Z] [debug] [pid:120871] received magic close
[2024-12-10T00:49:12.089917Z] [debug] [pid:120871] backend process exited: 0
[2024-12-10T00:49:12.190817Z] [warn] [pid:120871] !!! main: failed to start VM at /usr/lib/os-autoinst/backend/driver.pm line 104.
    backend::driver::start_vm(backend::driver=HASH(0x55b9f6c2f468)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Backend.pm line 18
    OpenQA::Isotovideo::Backend::new("OpenQA::Isotovideo::Backend") called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 109
    OpenQA::Isotovideo::Runner::create_backend(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/lib/os-autoinst/OpenQA/Isotovideo/Runner.pm line 251
    OpenQA::Isotovideo::Runner::init(OpenQA::Isotovideo::Runner=HASH(0x55b9edbfa680)) called at /usr/bin/isotovideo line 183
    eval {...} called at /usr/bin/isotovideo line 178

[2024-12-10T00:49:12.191180Z] [debug] [pid:120871] stopping command server 121330 because test execution ended through exception
[2024-12-10T00:49:12.391590Z] [debug] [pid:120871] done with command server
[2024-12-10T00:49:12.391689Z] [debug] [pid:120871] stopping autotest process 121362
[2024-12-10T00:49:12.592074Z] [debug] [pid:120871] done with autotest process
120871: EXIT 1
[2024-12-10T00:49:12.660935Z] [info] [pid:4058] Isotovideo exit status: 1
[2024-12-10T00:49:12.693330Z] [info] [pid:4058] +++ worker notes +++
[2024-12-10T00:49:12.693534Z] [info] [pid:4058] End time: 2024-12-10 00:49:12
[2024-12-10T00:49:12.693648Z] [info] [pid:4058] Result: died
[2024-12-10T00:49:12.701384Z] [info] [pid:121845] Uploading autoinst-log.txt

There are some more details in the slack thread but I didn't follow closely and don't completely understand what the "cause" of this was.

Actions #10

Updated by gpathak about 1 month ago · Edited

As @tinita suggested to find the job that actually started the leftover qemu process.
I found that on worker39:13, the job https://openqa.suse.de/tests/16150344/ was cancelled and after looking at the logs https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8085 found that QEMU didn't get signal 15 from isotovideo.
So, this was the first job on worker39:13 which didn't close the QEMU session gracefully and caused "Address Already in Use" error on further retries.

Marius mentioned that we already have an open ticket for the scenario when QEMU sometimes doesn't exit gracefully.
Maybe this is also related: #136238

Actions #11

Updated by tinita about 1 month ago

I found that both processes causing the hanging qemu process have this in common, with a tiny difference:
https://openqa.suse.de/tests/16150344/logfile?filename=autoinst-log.txt#line-8091

Fatal glibc error: malloc.c:4189 (_int_malloc): assertion failed: chunk_main_arena (fwd)

https://openqa.suse.de/tests/15995445/logfile?filename=autoinst-log.txt#line-5639

Fatal glibc error: malloc.c:4172 (_int_malloc): assertion failed: chunk_main_arena (bck->bk)    

In both cases we have ::: backend::driver::_collect_orphan: Driver backend collected unknown process after that, but that can't be the qemu process, because we know it was still running, causing the actual problem.

Actions #12

Updated by nicksinger about 1 month ago

this time we have something on openqaworker18. The last good job was:

what I noticed while trying to understand why 16219186 in-completed:

[2024-12-17T19:00:35.155667Z] [info] [pid:38979] Preparing cgroup to start isotovideo
[2024-12-17T19:00:35.156204Z] [warn] [pid:38979] Disabling cgroup usage because cgroup creation failed: mkdir /sys/fs/cgroup/systemd: Permission denied at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 85.
[2024-12-17T19:00:35.156268Z] [info] [pid:38979] You can define a custom slice with OPENQA_CGROUP_SLICE or indicating the base mount with MOJO_CGROUP_FS.

but not sure if related or not.

Actions #13

Updated by nicksinger about 1 month ago

Another broken instance on worker18: https://openqa.suse.de/tests/16236652 - I might reboot worker18 if nobody wants to look into it right now (asked in slack)

Actions #14

Updated by okurz 27 days ago

  • Subject changed from "Failed to find an available port: Address already in use", produces incomplete jobs on OSD - most likely related to worker39 size:M to [sporadic] auto_review:"Failed to find an available port: Address already in use":retry, produces incomplete jobs on OSD, multiple machines size:M
  • Priority changed from Normal to High

Also happened in https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3573426#L52 and there is no automatic retrying yet so we should handle this with higher priority. I also added an "auto_review" expression.

Actions #15

Updated by gpuliti 27 days ago

  • Description updated (diff)
  • Assignee set to gpuliti
Actions #16

Updated by gpuliti 26 days ago

  • Status changed from Workable to In Progress
Actions #17

Updated by gpuliti 26 days ago

  • Status changed from In Progress to Workable
  • Assignee deleted (gpuliti)
Actions #18

Updated by livdywan 13 days ago

  • Status changed from Workable to In Progress
  • Assignee set to livdywan

I'm investigating this atm

Actions #19

Updated by openqa_review 12 days ago

  • Due date set to 2025-01-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions #20

Updated by tinita 12 days ago · Edited

To search for recent failures like this you can use:

select concat('https://openqa.suse.de/t', id), id, assigned_worker_id, t_started, state, result, test, reason from jobs where reason = 'backend died: QEMU exited unexpectedly, see log for details' order by t_started desc limit 20;

It can theoretically be a different error, so the log still needs to be checked.

And we should kill the qemu process on that worker, so that the next job on this worker instance won't run into the same issue.

Actions #21

Updated by livdywan 11 days ago

  • Status changed from In Progress to Workable
  • Assignee deleted (livdywan)

I'm afraid there is something more important on my mind.

Actions #22

Updated by mkittler 6 days ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #23

Updated by mkittler 6 days ago · Edited

We saw this again on multiple slots. Example: https://openqa.suse.de/admin/workers/3377

Notes from my investigation:

Normally the worker slot should at least show up as broken but here it doesn't (it is idling).

systemctl shows the leftover process quite clearly:

openqa-worker-auto-restart@23.service - openQA Worker #23
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
     Active: active (running) since Sun 2025-01-12 05:18:31 UTC; 1 day 9h ago
    Process: 15228 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/23 (code=exited, status=0/SUCCESS)
   Main PID: 15237 (worker)
      Tasks: 7 (limit: 19660)
        CPU: 7h 31min 49.917s
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@23.service
             ├─15237 /usr/bin/perl /usr/share/openqa/script/worker --instance 23
             └─69998 /usr/bin/qemu-system-x86_64 -device virtio-vga,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -devi>

The leftover process can just be terminated, e.g. sudo kill 69998.

Link to job that caused the leftover process: https://openqa.suse.de/tests/16427146

Relevant worker logs:

Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [info] [pid:73422] Uploading serial_terminal.txt
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [debug] [pid:73422] Uploading artefact serial_terminal.txt
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [info] [pid:73422] Uploading serial_terminal_user.txt
Jan 13 14:07:12 qesapworker-prg5 worker[73422]: [debug] [pid:73422] Uploading artefact serial_terminal_user.txt
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Job 16427146 stopped as incomplete
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Unable to read result-check_network.json: Can't open file "/var/lib/openqa/pool/23/testresults/result-check_network.json": No such file or directory at />
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::File::slurp(Mojo::File=SCALAR(0x562378437cd8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1135
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         eval {...} called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 1135
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_read_json_file(OpenQA::Worker::Job=HASH(0x562379180a18), "result-check_network.json") called at /usr/share/openqa/script/../lib/OpenQA/Worker/J>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_read_module_result(OpenQA::Worker::Job=HASH(0x562379180a18), "check_network") called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm li>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_read_result_file(OpenQA::Worker::Job=HASH(0x562379180a18), "", ARRAY(0x5623791c2740)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/J>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_upload_results_step_0_prepare(OpenQA::Worker::Job=HASH(0x562379180a18), CODE(0x5623783eb540)) called at /usr/share/openqa/script/../lib/OpenQA/>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_upload_results(OpenQA::Worker::Job=HASH(0x562379180a18), CODE(0x5623783eb540)) called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm l>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_stop_step_5_2_upload(OpenQA::Worker::Job=HASH(0x562379180a18), "died", CODE(0x56237930eda8)) called at /usr/share/openqa/script/../lib/OpenQA/W>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::__ANON__() called at /usr/share/openqa/script/../lib/OpenQA/Worker/Job.pm line 120
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_invoke_after_result_upload(OpenQA::Worker::Job=HASH(0x562379180a18), CODE(0x562379312558)) called at /usr/share/openqa/script/../lib/OpenQA/Wor>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::_stop_step_5_1_upload(OpenQA::Worker::Job=HASH(0x562379180a18), "died", CODE(0x56237930eda8)) called at /usr/share/openqa/script/../lib/OpenQA/W>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::Job::__ANON__(Mojo::IOLoop::Subprocess=HASH(0x56237930b258), "", "died") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Subprocess.pm line 85
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::Subprocess::__ANON__(Mojo::IOLoop::Stream=HASH(0x56237930f270)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/EventEmitter.pm line 15
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::EventEmitter::emit(Mojo::IOLoop::Stream=HASH(0x56237930f270), "close") called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 27
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::Stream::close(Mojo::IOLoop::Stream=HASH(0x56237930f270)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 109
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::Stream::_read(Mojo::IOLoop::Stream=HASH(0x56237930f270)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/Stream.pm line 57
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::Stream::__ANON__(Mojo::Reactor::Poll=HASH(0x56237222f690)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::Reactor::Poll::_try(Mojo::Reactor::Poll=HASH(0x56237222f690), "I/O watcher", CODE(0x56237930d178), 0) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/>
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::Reactor::Poll::one_tick(Mojo::Reactor::Poll=HASH(0x56237222f690)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 101
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::Reactor::Poll::start(Mojo::Reactor::Poll=HASH(0x56237222f690)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 134
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         Mojo::IOLoop::start(Mojo::IOLoop=HASH(0x562373b66ac8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 369
Jan 13 14:07:12 qesapworker-prg5 worker[15237]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x562375602398)) called at /usr/share/openqa/script/worker line 120
Jan 13 14:07:12 qesapworker-prg5 worker[15237]: [debug] [pid:15237] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16427146/status"
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/glibc_locale-8.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-8.png as c62e2548b2e63f5b68ffe0f91e6bc96b
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/.thumbs/glibc_locale-8.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-8.png as c62e2548b2e63f5b68ffe0f91e6bc96b
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/glibc_locale-1.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-1.png as d804d6cb366b2158b5210b343401c366
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/.thumbs/glibc_locale-1.png
Jan 13 14:07:12 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-1.png as d804d6cb366b2158b5210b343401c366
…
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Optimizing /var/lib/openqa/pool/23/testresults/.thumbs/glibc_locale-3.png
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-3.png as 9fb8d6a2417669f089c1ce447b73e1ff
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-9.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-21.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-20.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-25.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-24.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-23.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-7.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-27.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-18.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-16.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-19.txt
Jan 13 14:07:13 qesapworker-prg5 worker[73425]: [debug] [pid:73425] Uploading artefact glibc_locale-22.txt
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Upload concluded (at glibc_locale)
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16427146/set_done?reason=died%3A+terminated+prematurely%2C+see+log+output+for+details&result=incom>
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Job 16427146 from openqa.suse.de finished - reason: died
Jan 13 14:07:13 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Cleaning up for next job
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Accepting job 16427266 from openqa.suse.de.
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Setting job 16427266 from openqa.suse.de up
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Preparing Mojo::IOLoop::ReadWriteProcess::Session
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [info] [pid:15237] +++ setup notes +++
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [info] [pid:15237] Running on qesapworker-prg5:23 (Linux 6.4.0-150600.23.30-default #1 SMP PREEMPT_DYNAMIC Sat Dec  7 08:37:53 UTC 2024 (8c25a0a) x86_64)
Jan 13 14:08:59 qesapworker-prg5 worker[15237]: [debug] [pid:15237] Job settings:
…

Relevant os-autoinst logs:

[2025-01-13T14:07:03.162048Z] [debug] [pid:69720] no change: 86.6s
[2025-01-13T14:07:03.510529Z] [debug] [pid:69720] no match: 86.6s, best candidate: grub2-20241114 (0.18)
[2025-01-13T14:07:07.026919Z] [warn] [pid:69720] !!! backend::baseclass::check_asserted_screen: check_asserted_screen took 2.86 seconds for 54 candidate needles - make your needles more specific
[2025-01-13T14:07:07.027016Z] [debug] [pid:69720] no match: 85.6s, best candidate: grub2-20241114 (0.18)
[2025-01-13T14:07:07.028906Z] [debug] [pid:69720] considering VNC stalled, no update for 4.52 seconds
[2025-01-13T14:07:07.038750Z] [debug] [pid:69720] no change: 82.7s
[2025-01-13T14:07:07.378976Z] [debug] [pid:69720] no match: 82.7s, best candidate: grub2-20241114 (0.18)
[2025-01-13T14:07:07.379278Z] [debug] [pid:69720] pointer type 1 0 1024 768 -257
[2025-01-13T14:07:07.379322Z] [debug] [pid:69720] led state 0 1 1 -261
corrupted size vs. prev_size
Error while decoding stream #0:0: Invalid data found when processing input
[2025-01-13T14:07:11.182059Z] [debug] [pid:69511] backend process exited: 0
[2025-01-13T14:07:11.183220Z] [debug] [pid:69511] isotovideo handle_shutdown
[2025-01-13T14:07:11.183480Z] [debug] [pid:69511] stopping command server 69521 because test execution ended
[2025-01-13T14:07:11.183704Z] [debug] [pid:69511] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20233/wxFqLKHC9jtwtMlN/broadcast
[2025-01-13T14:07:11.184467Z] [info] [pid:69511] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 69913 and exit status: 0
[2025-01-13T14:07:11.205481Z] [debug] [pid:69511] commands process exited: 0
[2025-01-13T14:07:11.305912Z] [debug] [pid:69511] done with command server
[2025-01-13T14:07:11.306221Z] [debug] [pid:69511] stopping autotest process 69530
[2025-01-13T14:07:11.306715Z] [debug] [pid:69530] autotest received signal TERM, saving results of current test before exiting
[2025-01-13T14:07:11.325243Z] [debug] [pid:69511] [autotest] process exited: 1
[2025-01-13T14:07:11.425890Z] [debug] [pid:69511] done with autotest process
[2025-01-13T14:07:11.426033Z] [debug] [pid:69511] isotovideo failed
[2025-01-13T14:07:11.428938Z] [debug] [pid:69511] stopping backend process 69720
[2025-01-13T14:07:11.429133Z] [debug] [pid:69511] done with backend process
69511: EXIT 1
[2025-01-13T14:07:11.521543Z] [info] [pid:15237] Isotovideo exit status: 1
[2025-01-13T14:07:11.546777Z] [info] [pid:15237] +++ worker notes +++
[2025-01-13T14:07:11.547021Z] [info] [pid:15237] End time: 2025-01-13 14:07:11
[2025-01-13T14:07:11.547142Z] [info] [pid:15237] Result: died
[2025-01-13T14:07:11.554374Z] [info] [pid:73422] Uploading image_info-basic_health_check.txt
[2025-01-13T14:07:11.581017Z] [info] [pid:73422] Uploading image_info-dmesg.txt
[2025-01-13T14:07:11.601762Z] [info] [pid:73422] Uploading image_info-ip-addr-show.txt
[2025-01-13T14:07:11.615916Z] [info] [pid:73422] Uploading image_info-journal.txt
[2025-01-13T14:07:11.681871Z] [info] [pid:73422] Uploading image_info-loadavg.txt
[2025-01-13T14:07:11.702895Z] [info] [pid:73422] Uploading image_info-problem_detection_logs.tar.xz
frame= 1353 fps=1.4 q=0.0 Lsize=    2068kB time=00:00:56.33 bitrate= 300.7kbits/s speed=0.059x    
video:2058kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.462934%
[2025-01-13T14:07:11.726346Z] [info] [pid:73422] Uploading image_info-psaxf.txt
[2025-01-13T14:07:11.743446Z] [info] [pid:73422] Uploading image_info-resolv-conf.txt
[2025-01-13T14:07:11.760713Z] [info] [pid:73422] Uploading image_info-sysconfig.tar.gz
[2025-01-13T14:07:11.780262Z] [info] [pid:73422] Uploading image_info-systemctl.txt
[2025-01-13T14:07:11.805747Z] [info] [pid:73422] Uploading image_info-systemctl_status.txt
[2025-01-13T14:07:11.830682Z] [info] [pid:73422] Uploading image_info-systemctl_unit-files.txt
[2025-01-13T14:07:11.850402Z] [info] [pid:73422] Uploading journal_check-audit.tar
[2025-01-13T14:07:11.870232Z] [info] [pid:73422] Uploading journal_check-full_journal.log
[2025-01-13T14:07:11.933653Z] [info] [pid:73422] Uploading pam-results.tap
[2025-01-13T14:07:11.961788Z] [info] [pid:73422] Uploading video.webm
[2025-01-13T14:07:12.037806Z] [info] [pid:73422] Uploading autoinst-log.txt
Actions #24

Updated by mkittler 5 days ago

This needs to be fixed in RWP (unless we want to inherit from its package and reimplement is_running and stop on our side): https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/61

I also came up with a small improvement for openQA: https://github.com/os-autoinst/openQA/pull/6110

Actions #25

Updated by mkittler 5 days ago · Edited

  • Status changed from In Progress to Feedback

Both PRs have been merged.

I also created https://build.opensuse.org/package/show/devel:openQA/perl-Mojo-IOLoop-ReadWriteProcess and deployed it on OSD workers via sudo salt -C 'G@roles:worker' cmd.run 'zypper in --allow-vendor-change perl-Mojo-IOLoop-ReadWriteProcess'. Judging by the output of sudo salt -C 'G@roles:worker' cmd.run "grep -- '-\$pid' /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop/ReadWriteProcess.pm" the new code is in place.

Note that the package hasn't been published for aarch64 yet so if we see it on that architecture that wouldn't necessarily mean that the fix didn't work.

So let's see whether this fixes the issue.

Actions #26

Updated by mkittler 4 days ago

We had to revert changes due to a race condition.

This change will fix it: https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/62

I'll continue deploying it later but only on a single worker.

I'll also try to improve tests in Mojo-RWP.

Actions #27

Updated by okurz 4 days ago

  • Related to action #175464: jobs incomplete with auto_review:"setup failure: isotovideo can not be started" added
Actions #28

Updated by mkittler 4 days ago · Edited

This is also a bit problematic when it comes to stopping the process. That means if we call stop before setpgrp(0, 0) runs (or setpgrp(0, 0) does not run after all for some reason) the process cannot be stopped. We always had this problem (also before my recent changes) so supposedly it doesn't really matter in practice. However, here's a daft for how fixing this: https://github.com/openSUSE/Mojo-IOLoop-ReadWriteProcess/pull/64

Actions #29

Updated by okurz 4 days ago

  • Parent task set to #175515
Actions #30

Updated by mkittler 4 days ago

A version of RWP I will deploy tomorrow on a single production test host to see whether my latest changed work in production: https://build.opensuse.org/package/show/home:mkittler:branches:devel:languages:perl/perl-Mojo-IOLoop-ReadWriteProcess

Actions #31

Updated by mkittler 3 days ago

As mentioned on Slack:

I now deployed a fixed version of the change on openqaworker14.qa.suse.cz because that was one of the previously affected machines. If you see any issues, simply disable the repository home_mkittler_test_poo170209 I added there and install the regular version of RWP again via zypper in --allow-vendor-change perl-Mojo-IOLoop-ReadWriteProcess. (Of course I'll handle that, just in case I'm not around.)

Actions #32

Updated by mkittler 2 days ago

So far the situation looks good on openqaworker14:

openqa=> select id, t_finished, result, reason from jobs where (select host from workers where id = assigned_worker_id) = 'openqaworker14' and result = 'incomplete' and t_finished >= '2025-01-16T12:00:00' order by t_finished;
    id    |     t_finished      |   result   |                           reason                            
----------+---------------------+------------+-------------------------------------------------------------
 16479164 | 2025-01-16 14:01:27 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479166 | 2025-01-16 14:02:29 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479171 | 2025-01-16 14:03:26 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479176 | 2025-01-16 14:04:46 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479177 | 2025-01-16 14:05:50 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479178 | 2025-01-16 14:06:53 | incomplete | backend died: QEMU exited unexpectedly, see log for details
 16479179 | 2025-01-16 14:07:57 | incomplete | backend died: QEMU exited unexpectedly, see log for details
(7 rows)

These incompletes were caused by a leftover process on slot 14 which I have cleaned up only now. The leftover was from https://openqa.suse.de/tests/16472470 which ran 12 hours ago which was before I deployed my changes.

That means there were so far no relevant leftover processes and also no regressions. (I also checked failed jobs but also couldn't find any relevant jobs.)

Actions #33

Updated by rfan1 2 days ago

I can hit the issue again, https://openqa.suse.de/tests/16486907

Can you please help double check?

Actions #34

Updated by mkittler 2 days ago

This slot looks fine. I assume you have already killed the leftover process yourself then.

I ran the query from yesterday again to check on openqaworker14 and it looks good. So I'm going to deploy my change everywhere again.

Actions #35

Updated by mkittler 2 days ago

Considering there are other problems with OSD right now we decided to deploy the fix everywhere only next week.


Until then, here's how you can workaround it:

Check whether there's a leftover process:

martchus@qesapworker-prg6:~> sudo systemctl status openqa-worker-auto-restart@8
● openqa-worker-auto-restart@8.service - openQA Worker #8
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
     Active: active (running) since Sun 2025-01-12 05:18:28 UTC; 1 day 9h ago
    Process: 21274 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/8 (code=exited, status=0/SUCCESS)
   Main PID: 21280 (worker)
      Tasks: 9 (limit: 19660)
        CPU: 1h 45min 11.987s
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@8.service
             ├─21280 /usr/bin/perl /usr/share/openqa/script/worker --instance 8
             ├─23645 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device inte>
             └─75353 swtpm socket --tpmstate dir=/tmp/mytpm8 --ctrl type=unixio,path=/tmp/mytpm8/swtpm-sock --log level=20 -d

Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         eval {...} called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 141
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         Mojo::Reactor::Poll::_try(Mojo::Reactor::Poll=HASH(0x555de2562340), "I/O watcher", CODE(0x555de6399db8), 0) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/>
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         Mojo::Reactor::Poll::one_tick(Mojo::Reactor::Poll=HASH(0x555de2562340)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/Reactor/Poll.pm line 101
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         Mojo::Reactor::Poll::start(Mojo::Reactor::Poll=HASH(0x555de2562340)) called at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/IOLoop.pm line 134
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         Mojo::IOLoop::start(Mojo::IOLoop=HASH(0x555de3e99fe8)) called at /usr/share/openqa/script/../lib/OpenQA/Worker.pm line 369
Jan 13 15:16:50 qesapworker-prg6 worker[21280]:         OpenQA::Worker::exec(OpenQA::Worker=HASH(0x555de5a1a798)) called at /usr/share/openqa/script/worker line 120
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] Upload concluded (no current module)
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/16436185/set_done?reason=backend+died%3A+QEMU+exited+unexpectedly%2C+see+log+for+details&result=in>
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] Job 16436185 from openqa.suse.de finished - reason: died
Jan 13 15:16:50 qesapworker-prg6 worker[21280]: [debug] [pid:21280] Cleaning up for next job

Terminate it:

martchus@qesapworker-prg6:~> sudo kill 23645 75353

Run kill again to see if it worked:

martchus@qesapworker-prg6:~> sudo kill 23645 75353
kill: sending signal to 23645 failed: No such process
kill: sending signal to 75353 failed: No such process
Actions

Also available in: Atom PDF